[
https://issues.apache.org/jira/browse/ACCUMULO-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13225307#comment-13225307
]
Adam Fuchs commented on ACCUMULO-452:
-------------------------------------
I had another thought on this: locality groups are good for features that are
in a relatively constant, low cardinality set with a fairly dense distribution
across the primary partitioning dimension. Also, queries must be aligned with
the locality group frequently enough to amortize the cost of that partitioning.
This means that the current column family-based locality groups only really
help when cells in sorted order frequently oscillate between locality groups. I
want to say that this type of feature tends to be something that is explicitly
modeled based on how the user wants to query their data. If the user decides to
put this information in the row or the column qualifier, could they just as
easily put it into the column family? By the way, expressions like John
mentions in ACCUMULO-164 help to groups high cardinality features into a low
cardinality set of groups, so I think we're on the same page there.
Partitioning based on the timestamp is an interesting consideration. In this
case, you would want a small number of ranges of timestamps to be "active" (not
aged off yet) at any one time. Timestamps are a bit special, though, because
they tend to be inserted in increasing order. Instead of using the locality
group mechanism, we might achieve better performance by modifying the major
compaction selection algorithm to avoid merging files that have very different
timestamp ranges. Keeping track of timestamps on a per-file or per-block basis
would support bulk filtering, and would be as (or more) efficient than locality
groups. Might this be another approach to consider?
Like Aaron, I think we need some more details on envisioned scenarios in which
more generic locality groups would be useful before we jump too deeply into
implementing them.
> Generalize locality groups
> --------------------------
>
> Key: ACCUMULO-452
> URL: https://issues.apache.org/jira/browse/ACCUMULO-452
> Project: Accumulo
> Issue Type: New Feature
> Reporter: Keith Turner
> Fix For: 1.5.0
>
>
> Locality groups are a neat feature, but there is no reason to limit
> partitioning to column families. Data could be partitioned based on any
> criteria. For example if a user is interested in querying recent data and
> ageing off old data partitioning locality groups based in timestamp would be
> useful. This could be accomplished by letting users specify a partitioner
> plugin that is used at compaction and scan time. Scans would need an ability
> to pass options to the partitioner.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira