Consider using Storm, Pig, Spark, or your own framework to handle the in-memory aggregation before giving the data to the BatchWriter. Why would any part of Accumulo code be responsible for this kind of application-specific data handling?
On Tue, Jun 9, 2015 at 3:17 PM, [email protected] < [email protected]> wrote: > Just to clarify the origin of my question. > > > > I had to do some performance tests to compare different storage types of > “raw” data against each other. > > > > Hopefully, picture below is visible in the mailing list. If not, I will > put it somewhere else. > > > > 6 million “original” records, 1.3GB data, 233 bytes per record > > Each original record is 40 fields delimited by tab, on average 19 – not > null > > Batchwriter, single java program > > > > First three bars represent single “heavy” mutation to insert the whole > tabular line / serialized object. > > 4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in > one mutation) > > 8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in > separate mutations) - ~19 mutations per original record > > > > On average, single “heavy” mutations are 7-10 times faster than anything > else, composite are 10%-35% faster than individual. > > > > I am not an expert how Accumulo is implemented internally, however it > looks like composite mutation is treated more or less in the same way as a > set of individual mutations. Probably, largest overhead is added by WAL. > > > > > > Data utilization before and after manual compaction of test table and all > system tables: > > > > > > It’s not clear why “accumulo du” shows twice less data used comparing to > “hdfs du”. > > > > All these tests made us think that we can improve performance by doing > some calculations in-memory (and our use-case fits very well) and reducing > number of mutations. Now I am trying to understand whether there is a > relatively easy way to do this with Accumulo or whether it’s time to look > closer into something like Spark. > > > > Thanks > > Roman > > > > > > > > > > *From:* Adam Fuchs [mailto:[email protected]] > *Sent:* 09 June 2015 19:08 > > *To:* [email protected] > *Subject:* Re: micro compaction > > > > I think this might be the same concept as in-mapper combining, but applied > to data being sent to a BatchWriter rather than an OutputCollector. See > [1], section 3.1.1. A similar performance analysis and probably a lot of > the same code should apply here. > > > > Cheers, > > Adam > > > > [1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf > > > > On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <[email protected]> > wrote: > > Having a combiner stack (more generally an iterator stack) run on the > client-side seems to be the second most popular request on this list. The > most popular being, "How do I write to Accumulo from inside an iterator?" > > > > Such a thing would be very useful for me, too. I have some cycles to help > out, if somebody can give me an idea of where to get started and where the > potential land-mines are. > > > > -Russ > > > > On Tue, Jun 9, 2015 at 9:08 AM [email protected] < > [email protected]> wrote: > > Aggregated output is tiny, so if I do same calculations in memory > (instead of sending mutations to Accumulo) , I can reduce overall number of > mutations by 1000x or so > > > > -----Original Message----- > From: Josh Elser [mailto:[email protected]] > Sent: 09 June 2015 16:54 > To: [email protected] > Subject: Re: micro compaction > > Well, you win the prize for new terminology. I haven't ever heard the term > "micro compaction" before. > > Can you clarify though, you say hundreds of millions of mutations that > result in megabytes of data. Is that an increase or decrease in size. > Comparing apples to oranges :) > > [email protected] wrote: > > Hi guys, > > > > While doing pre-analytics we generate hundreds of millions of > > mutations that result in 1-100 megabytes of useful data after major > > compaction. We ingest into Accumulo using MR from Mapper job. We > > identified that performance really degrades while increasing a number of > mutations. > > > > The obvious improvement is to do some calculations in-memory before > > sending mutations to Accumulo. > > > > Of course, at the same time we are looking for a solution to minimize > > development effort. > > > > I guess I am asking about micro compaction/ingest-time iterators on > > the client side (before data is sent to Accumulo). > > > > To my understanding, Accumulo does not support them, is it correct? > > And if so, are there any plans to support this functionality in the > future? > > > > Thanks > > > > Roman > > > > Please consider the environment before printing this email. This > > message should be regarded as confidential. If you have received this > > email in error please notify the sender and destroy it immediately. > > Statements of intent shall only become binding when confirmed in hard > > copy by an authorised signatory. The contents of this email may relate > > to dealings with other companies under the control of BAE Systems > > Applied Intelligence Limited, details of which can be found at > > http://www.baesystems.com/Businesses/index.htm. > Please consider the environment before printing this email. This message > should be regarded as confidential. If you have received this email in > error please notify the sender and destroy it immediately. Statements of > intent shall only become binding when confirmed in hard copy by an > authorised signatory. The contents of this email may relate to dealings > with other companies under the control of BAE Systems Applied Intelligence > Limited, details of which can be found at > http://www.baesystems.com/Businesses/index.htm. > > > Please consider the environment before printing this email. This message > should be regarded as confidential. If you have received this email in > error please notify the sender and destroy it immediately. Statements of > intent shall only become binding when confirmed in hard copy by an > authorised signatory. The contents of this email may relate to dealings > with other companies under the control of BAE Systems Applied Intelligence > Limited, details of which can be found at > http://www.baesystems.com/Businesses/index.htm. >
