Re: micro compaction

Russ Weeks Tue, 09 Jun 2015 12:55:09 -0700

For consistency and ease of implementation. Say I've written a stack of
combiners that do statistical aggregation, sampling etc. on my table.
Rather than port that logic to a Storm topology or to the DStream API I'd
just like to turn that stack on in my BatchWriter.


On Tue, Jun 9, 2015 at 12:47 PM David Medinets <[email protected]>
wrote:

> Consider using Storm, Pig, Spark, or your own framework to handle the
> in-memory aggregation before giving the data to the BatchWriter. Why would
> any part of Accumulo code be responsible for this kind of
> application-specific data handling?
>
> On Tue, Jun 9, 2015 at 3:17 PM, [email protected] <
> [email protected]> wrote:
>
>>  Just to clarify the origin of my question.
>>
>>
>>
>> I had to do some performance tests to compare different storage types of
>> “raw” data against each other.
>>
>>
>>
>> Hopefully, picture below is visible in the mailing list. If not, I will
>> put it somewhere else.
>>
>>
>>
>> 6 million “original” records, 1.3GB data, 233 bytes per record
>>
>> Each original record is 40 fields delimited by tab, on average 19 – not
>> null
>>
>> Batchwriter, single java program
>>
>>
>>
>> First three bars represent single “heavy” mutation to insert the whole
>> tabular line / serialized object.
>>
>> 4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in
>> one mutation)
>>
>> 8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in
>> separate mutations) - ~19 mutations per original record
>>
>>
>>
>> On average, single “heavy” mutations are 7-10 times faster than anything
>> else, composite are 10%-35% faster than individual.
>>
>>
>>
>> I am not an expert how Accumulo is implemented internally, however it
>> looks like composite mutation is treated more or less in the same way as a
>> set of individual mutations. Probably, largest overhead is added by WAL.
>>
>>
>>
>>
>>
>> Data utilization before and after manual compaction of test table and all
>> system tables:
>>
>>
>>
>>
>>
>> It’s not clear why “accumulo du” shows twice less data used comparing to
>> “hdfs du”.
>>
>>
>>
>> All these tests made us think that we can improve performance by doing
>> some calculations in-memory (and our use-case fits very well) and reducing
>> number of mutations. Now I am trying to understand whether there is a
>> relatively easy way to do this with Accumulo or whether it’s time to look
>> closer into something like Spark.
>>
>>
>>
>> Thanks
>>
>> Roman
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Adam Fuchs [mailto:[email protected]]
>> *Sent:* 09 June 2015 19:08
>>
>> *To:* [email protected]
>> *Subject:* Re: micro compaction
>>
>>
>>
>> I think this might be the same concept as in-mapper combining, but
>> applied to data being sent to a BatchWriter rather than an OutputCollector.
>> See [1], section 3.1.1. A similar performance analysis and probably a lot
>> of the same code should apply here.
>>
>>
>>
>> Cheers,
>>
>> Adam
>>
>>
>>
>> [1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf
>>
>>
>>
>> On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <[email protected]>
>> wrote:
>>
>> Having a combiner stack (more generally an iterator stack) run on the
>> client-side seems to be the second most popular request on this list. The
>> most popular being, "How do I write to Accumulo from inside an iterator?"
>>
>>
>>
>> Such a thing would be very useful for me, too. I have some cycles to help
>> out, if somebody can give me an idea of where to get started and where the
>> potential land-mines are.
>>
>>
>>
>> -Russ
>>
>>
>>
>> On Tue, Jun 9, 2015 at 9:08 AM [email protected] <
>> [email protected]> wrote:
>>
>> Aggregated output is tiny,  so if I do same calculations in memory
>> (instead of sending mutations to Accumulo) , I can reduce overall number of
>> mutations by 1000x or so
>>
>>
>>
>> -----Original Message-----
>> From: Josh Elser [mailto:[email protected]]
>> Sent: 09 June 2015 16:54
>> To: [email protected]
>> Subject: Re: micro compaction
>>
>> Well, you win the prize for new terminology. I haven't ever heard the
>> term "micro compaction" before.
>>
>> Can you clarify though, you say hundreds of millions of mutations that
>> result in megabytes of data. Is that an increase or decrease in size.
>> Comparing apples to oranges :)
>>
>> [email protected] wrote:
>> > Hi guys,
>> >
>> > While doing pre-analytics we generate hundreds of millions of
>> > mutations that result in 1-100 megabytes of useful data after major
>> > compaction. We ingest into Accumulo using MR from Mapper job. We
>> > identified that performance really degrades while increasing a number
>> of mutations.
>> >
>> > The obvious improvement is to do some calculations in-memory before
>> > sending mutations to Accumulo.
>> >
>> > Of course, at the same time we are looking for a solution to minimize
>> > development effort.
>> >
>> > I guess I am asking about micro compaction/ingest-time iterators on
>> > the client side (before data is sent to Accumulo).
>> >
>> > To my understanding, Accumulo does not support them, is it correct?
>> > And if so, are there any plans to support this functionality in the
>> future?
>> >
>> > Thanks
>> >
>> > Roman
>> >
>> > Please consider the environment before printing this email. This
>> > message should be regarded as confidential. If you have received this
>> > email in error please notify the sender and destroy it immediately.
>> > Statements of intent shall only become binding when confirmed in hard
>> > copy by an authorised signatory. The contents of this email may relate
>> > to dealings with other companies under the control of BAE Systems
>> > Applied Intelligence Limited, details of which can be found at
>> > http://www.baesystems.com/Businesses/index.htm.
>> Please consider the environment before printing this email. This message
>> should be regarded as confidential. If you have received this email in
>> error please notify the sender and destroy it immediately. Statements of
>> intent shall only become binding when confirmed in hard copy by an
>> authorised signatory. The contents of this email may relate to dealings
>> with other companies under the control of BAE Systems Applied Intelligence
>> Limited, details of which can be found at
>> http://www.baesystems.com/Businesses/index.htm.
>>
>>
>>  Please consider the environment before printing this email. This
>> message should be regarded as confidential. If you have received this email
>> in error please notify the sender and destroy it immediately. Statements of
>> intent shall only become binding when confirmed in hard copy by an
>> authorised signatory. The contents of this email may relate to dealings
>> with other companies under the control of BAE Systems Applied Intelligence
>> Limited, details of which can be found at
>> http://www.baesystems.com/Businesses/index.htm.
>>
>
>

Re: micro compaction

Reply via email to