[ 
https://issues.apache.org/jira/browse/ACCUMULO-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13225307#comment-13225307
 ] 

Adam Fuchs commented on ACCUMULO-452:
-------------------------------------

I had another thought on this: locality groups are good for features that are 
in a relatively constant, low cardinality set with a fairly dense distribution 
across the primary partitioning dimension. Also, queries must be aligned with 
the locality group frequently enough to amortize the cost of that partitioning. 
This means that the current column family-based locality groups only really 
help when cells in sorted order frequently oscillate between locality groups. I 
want to say that this type of feature tends to be something that is explicitly 
modeled based on how the user wants to query their data. If the user decides to 
put this information in the row or the column qualifier, could they just as 
easily put it into the column family? By the way, expressions like John 
mentions in ACCUMULO-164 help to groups high cardinality features into a low 
cardinality set of groups, so I think we're on the same page there.

Partitioning based on the timestamp is an interesting consideration. In this 
case, you would want a small number of ranges of timestamps to be "active" (not 
aged off yet) at any one time. Timestamps are a bit special, though, because 
they tend to be inserted in increasing order. Instead of using the locality 
group mechanism, we might achieve better performance by modifying the major 
compaction selection algorithm to avoid merging files that have very different 
timestamp ranges. Keeping track of timestamps on a per-file or per-block basis 
would support bulk filtering, and would be as (or more) efficient than locality 
groups. Might this be another approach to consider?

Like Aaron, I think we need some more details on envisioned scenarios in which 
more generic locality groups would be useful before we jump too deeply into 
implementing them.
                
> Generalize locality groups
> --------------------------
>
>                 Key: ACCUMULO-452
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-452
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>             Fix For: 1.5.0
>
>
> Locality groups are a neat feature, but there is no reason to limit 
> partitioning to column families.  Data could be partitioned based on any 
> criteria.  For example if a user is interested in querying recent data and 
> ageing off old data partitioning locality groups based in timestamp would be 
> useful.  This could be accomplished by letting users specify a partitioner 
> plugin that is used at compaction and scan time.  Scans would need an ability 
> to pass options to the partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to