[GitHub] accumulo issue #180: Taking a crack at new summarization API

keith-turner Wed, 09 Nov 2016 07:03:36 -0800

Github user keith-turner commented on the issue:

    https://github.com/apache/accumulo/pull/180
  
    > For example, we would want to avoid storing 1M CVs if a user had that 
many in a table (for some reason).
    
    I think we should address this issue in some way while considering the 
following.
    
     * Fetching summaries should be relatively fast.  Gigantic summaries will 
stymie this goal.
     * When a users summarizer does produce a gigantic summary, it would be 
nice if we helped them debug it.
    
    I am thinking one way to accomplish these goals is to store gigantic 
summaries, but only read summaries under a certain size.  The size of a 
serialized summary could be written first.  When a summary is read this size 
will be the first bit of info.  If the summary is over a certain size an error 
could be logged and that file would be treated like it had no summary.  We 
could also add a enum that indicates gigantic summaries were present.  Since 
the summary is stored, it would give the user a chance to use rfile-info to 
look at whats in the summary for debugging.
    
    We also need to stress in the javadoc that summaries are intended to be 
small.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] accumulo issue #180: Taking a crack at new summarization API

Reply via email to