[ 
https://issues.apache.org/jira/browse/ACCUMULO-452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13225415#comment-13225415
 ] 

Keith Turner edited comment on ACCUMULO-452 at 3/8/12 7:14 PM:
---------------------------------------------------------------

I am thinking that keeping a min/max time stamp per file may satisfy some use 
cases but not all.  It would certainly be helpful.  The compaction algorithm 
may need to be modified as Adam suggested to make it more effective.  The way 
major compaction currently works in Accumulo older data will eventually end up 
in the largest file.  If your goal is to avoid this file under certain 
circumstances, then the user has no explicit control over that.  Also if you 
want to age off older data, you will probably still need to read this entire 
file to do that.   

If they want to scan the last 6 months of data for example and the largest file 
overlaps this time range but only 10% of the data in the file matches the 
range, then a lot of data needs to be filtered.  Does HBase do anything special 
to deal with case.   

Why limit locality groups to only column families? 
 * Increases model complexity.  I think this is true.  I think the complexity 
of the locality group model is not increased.  If you understand partitioning 
on column families, you will easily understand the concept of partitioning on 
any part of the key.  It certainly does increase the complexity of the big 
table model as a whole though.  It would certainly give users more rope to hang 
themselves.  Personally I am not opposed to this.
 * Increases code complexity.  I do not think this is true.  This would 
actually simplify the code and make this functionality much easier to test in 
isolation.  I have found this with iterators, they dramatically decreased the 
complexity of the scan code. When iterators were first introduced, the scan 
loop was starting to get fairly complex.  This seems a lot cleaner than 
customizing the current code to meet needs. OF course, end users may not care 
about the complexity of the accumulo source code.  They just want it to solve 
their problems.
 * There are no compeling use cases.  These must exist.  I think the original 
time based locality group is one, is their a better simpler way to achieve 
this?  That would remove this use case.  The HBase design is simpler in terms 
of the model, but the code sounds more complex.  Also this model does not give 
the user explicit control w/o allowing them to configure the compaction process 
in some complex way.  

                
      was (Author: kturner):
    I am thinking that keeping a min/max time stamp per file may satisfy some 
use cases but not all.  It would certainly be helpful.  The compaction 
algorithm may need to be modified as Adam suggested to make it more effective.  
The way major compaction currently works in Accumulo older data will eventually 
end up in the largest file.  If your goal is to avoid this file under certain 
circumstances, then the user has no explicit control over that.  Also if you 
want to age off older data, you will probably still need to read this entire 
file to do that.   

If they want to scan the last 6 months of data for example and the largest file 
overlaps this time range but only 10% of the data in the file matches the 
range, then a lot of data needs to be filtered.  Do HBase do anything special 
to deal with case.   

Why limit partitioning to only locality groups? 
 * Increases model complexity.  I think this is true.  I think the complexity 
of the locality group model is not increased.  If you understand partitioning 
on column families, you will easily understand the concept of partitioning on 
any part of the key.  It certainly does increase the complexity of the big 
table model as a whole though.  It would certainly give users more rope to hang 
themselves.  Personally I am not opposed to this.
 * Increases code complexity.  I do not think this is true.  This would 
actually simplify the code and make this functionality much easier to test in 
isolation.  I have found this with iterators, they dramatically decreased the 
complexity of the scan code. When iterators were first introduced, the scan 
loop was starting to get fairly complex.  This seems a lot cleaner than 
customizing the current code to meet needs. OF course, end users may not care 
about the complexity of the accumulo source code.  They just want it to solve 
their problems.
 * There are no compeling use cases.  These must exist.  I think the original 
time based locality group is one, is their a better simpler way to achieve 
this?  That would remove this use case.  The HBase design is simpler in terms 
of the model, but the code sounds more complex.  Also this model does not give 
the user explicit control w/o allowing them to configure the compaction process 
in some complex way.  


                  
> Generalize locality groups
> --------------------------
>
>                 Key: ACCUMULO-452
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-452
>             Project: Accumulo
>          Issue Type: New Feature
>            Reporter: Keith Turner
>             Fix For: 1.5.0
>
>         Attachments: PartitionerDesign.txt
>
>
> Locality groups are a neat feature, but there is no reason to limit 
> partitioning to column families.  Data could be partitioned based on any 
> criteria.  For example if a user is interested in querying recent data and 
> ageing off old data partitioning locality groups based in timestamp would be 
> useful.  This could be accomplished by letting users specify a partitioner 
> plugin that is used at compaction and scan time.  Scans would need an ability 
> to pass options to the partitioner.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to