Re: micro compaction

David Medinets Tue, 09 Jun 2015 12:47:39 -0700

Consider using Storm, Pig, Spark, or your own framework to handle the
in-memory aggregation before giving the data to the BatchWriter. Why would
any part of Accumulo code be responsible for this kind of
application-specific data handling?


On Tue, Jun 9, 2015 at 3:17 PM, [email protected] <
[email protected]> wrote:

>  Just to clarify the origin of my question.
>
>
>
> I had to do some performance tests to compare different storage types of
> “raw” data against each other.
>
>
>
> Hopefully, picture below is visible in the mailing list. If not, I will
> put it somewhere else.
>
>
>
> 6 million “original” records, 1.3GB data, 233 bytes per record
>
> Each original record is 40 fields delimited by tab, on average 19 – not
> null
>
> Batchwriter, single java program
>
>
>
> First three bars represent single “heavy” mutation to insert the whole
> tabular line / serialized object.
>
> 4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in
> one mutation)
>
> 8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in
> separate mutations) - ~19 mutations per original record
>
>
>
> On average, single “heavy” mutations are 7-10 times faster than anything
> else, composite are 10%-35% faster than individual.
>
>
>
> I am not an expert how Accumulo is implemented internally, however it
> looks like composite mutation is treated more or less in the same way as a
> set of individual mutations. Probably, largest overhead is added by WAL.
>
>
>
>
>
> Data utilization before and after manual compaction of test table and all
> system tables:
>
>
>
>
>
> It’s not clear why “accumulo du” shows twice less data used comparing to
> “hdfs du”.
>
>
>
> All these tests made us think that we can improve performance by doing
> some calculations in-memory (and our use-case fits very well) and reducing
> number of mutations. Now I am trying to understand whether there is a
> relatively easy way to do this with Accumulo or whether it’s time to look
> closer into something like Spark.
>
>
>
> Thanks
>
> Roman
>
>
>
>
>
>
>
>
>
> *From:* Adam Fuchs [mailto:[email protected]]
> *Sent:* 09 June 2015 19:08
>
> *To:* [email protected]
> *Subject:* Re: micro compaction
>
>
>
> I think this might be the same concept as in-mapper combining, but applied
> to data being sent to a BatchWriter rather than an OutputCollector. See
> [1], section 3.1.1. A similar performance analysis and probably a lot of
> the same code should apply here.
>
>
>
> Cheers,
>
> Adam
>
>
>
> [1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
>
>
>
> On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <[email protected]>
> wrote:
>
> Having a combiner stack (more generally an iterator stack) run on the
> client-side seems to be the second most popular request on this list. The
> most popular being, "How do I write to Accumulo from inside an iterator?"
>
>
>
> Such a thing would be very useful for me, too. I have some cycles to help
> out, if somebody can give me an idea of where to get started and where the
> potential land-mines are.
>
>
>
> -Russ
>
>
>
> On Tue, Jun 9, 2015 at 9:08 AM [email protected] <
> [email protected]> wrote:
>
> Aggregated output is tiny,  so if I do same calculations in memory
> (instead of sending mutations to Accumulo) , I can reduce overall number of
> mutations by 1000x or so
>
>
>
> -----Original Message-----
> From: Josh Elser [mailto:[email protected]]
> Sent: 09 June 2015 16:54
> To: [email protected]
> Subject: Re: micro compaction
>
> Well, you win the prize for new terminology. I haven't ever heard the term
> "micro compaction" before.
>
> Can you clarify though, you say hundreds of millions of mutations that
> result in megabytes of data. Is that an increase or decrease in size.
> Comparing apples to oranges :)
>
> [email protected] wrote:
> > Hi guys,
> >
> > While doing pre-analytics we generate hundreds of millions of
> > mutations that result in 1-100 megabytes of useful data after major
> > compaction. We ingest into Accumulo using MR from Mapper job. We
> > identified that performance really degrades while increasing a number of
> mutations.
> >
> > The obvious improvement is to do some calculations in-memory before
> > sending mutations to Accumulo.
> >
> > Of course, at the same time we are looking for a solution to minimize
> > development effort.
> >
> > I guess I am asking about micro compaction/ingest-time iterators on
> > the client side (before data is sent to Accumulo).
> >
> > To my understanding, Accumulo does not support them, is it correct?
> > And if so, are there any plans to support this functionality in the
> future?
> >
> > Thanks
> >
> > Roman
> >
> > Please consider the environment before printing this email. This
> > message should be regarded as confidential. If you have received this
> > email in error please notify the sender and destroy it immediately.
> > Statements of intent shall only become binding when confirmed in hard
> > copy by an authorised signatory. The contents of this email may relate
> > to dealings with other companies under the control of BAE Systems
> > Applied Intelligence Limited, details of which can be found at
> > http://www.baesystems.com/Businesses/index.htm.
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>
>
>  Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at
> http://www.baesystems.com/Businesses/index.htm.
>

Re: micro compaction

Reply via email to