For consistency and ease of implementation. Say I've written a stack of combiners that do statistical aggregation, sampling etc. on my table. Rather than port that logic to a Storm topology or to the DStream API I'd just like to turn that stack on in my BatchWriter.
On Tue, Jun 9, 2015 at 12:47 PM David Medinets <[email protected]> wrote: > Consider using Storm, Pig, Spark, or your own framework to handle the > in-memory aggregation before giving the data to the BatchWriter. Why would > any part of Accumulo code be responsible for this kind of > application-specific data handling? > > On Tue, Jun 9, 2015 at 3:17 PM, [email protected] < > [email protected]> wrote: > >> Just to clarify the origin of my question. >> >> >> >> I had to do some performance tests to compare different storage types of >> “raw” data against each other. >> >> >> >> Hopefully, picture below is visible in the mailing list. If not, I will >> put it somewhere else. >> >> >> >> 6 million “original” records, 1.3GB data, 233 bytes per record >> >> Each original record is 40 fields delimited by tab, on average 19 – not >> null >> >> Batchwriter, single java program >> >> >> >> First three bars represent single “heavy” mutation to insert the whole >> tabular line / serialized object. >> >> 4,5,6,7 bars – composite mutation (all qualifiers for the same rowid in >> one mutation) >> >> 8, 9, 10, 11 – individual mutations (all qualifiers for the same rowid in >> separate mutations) - ~19 mutations per original record >> >> >> >> On average, single “heavy” mutations are 7-10 times faster than anything >> else, composite are 10%-35% faster than individual. >> >> >> >> I am not an expert how Accumulo is implemented internally, however it >> looks like composite mutation is treated more or less in the same way as a >> set of individual mutations. Probably, largest overhead is added by WAL. >> >> >> >> >> >> Data utilization before and after manual compaction of test table and all >> system tables: >> >> >> >> >> >> It’s not clear why “accumulo du” shows twice less data used comparing to >> “hdfs du”. >> >> >> >> All these tests made us think that we can improve performance by doing >> some calculations in-memory (and our use-case fits very well) and reducing >> number of mutations. Now I am trying to understand whether there is a >> relatively easy way to do this with Accumulo or whether it’s time to look >> closer into something like Spark. >> >> >> >> Thanks >> >> Roman >> >> >> >> >> >> >> >> >> >> *From:* Adam Fuchs [mailto:[email protected]] >> *Sent:* 09 June 2015 19:08 >> >> *To:* [email protected] >> *Subject:* Re: micro compaction >> >> >> >> I think this might be the same concept as in-mapper combining, but >> applied to data being sent to a BatchWriter rather than an OutputCollector. >> See [1], section 3.1.1. A similar performance analysis and probably a lot >> of the same code should apply here. >> >> >> >> Cheers, >> >> Adam >> >> >> >> [1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf >> >> >> >> On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <[email protected]> >> wrote: >> >> Having a combiner stack (more generally an iterator stack) run on the >> client-side seems to be the second most popular request on this list. The >> most popular being, "How do I write to Accumulo from inside an iterator?" >> >> >> >> Such a thing would be very useful for me, too. I have some cycles to help >> out, if somebody can give me an idea of where to get started and where the >> potential land-mines are. >> >> >> >> -Russ >> >> >> >> On Tue, Jun 9, 2015 at 9:08 AM [email protected] < >> [email protected]> wrote: >> >> Aggregated output is tiny, so if I do same calculations in memory >> (instead of sending mutations to Accumulo) , I can reduce overall number of >> mutations by 1000x or so >> >> >> >> -----Original Message----- >> From: Josh Elser [mailto:[email protected]] >> Sent: 09 June 2015 16:54 >> To: [email protected] >> Subject: Re: micro compaction >> >> Well, you win the prize for new terminology. I haven't ever heard the >> term "micro compaction" before. >> >> Can you clarify though, you say hundreds of millions of mutations that >> result in megabytes of data. Is that an increase or decrease in size. >> Comparing apples to oranges :) >> >> [email protected] wrote: >> > Hi guys, >> > >> > While doing pre-analytics we generate hundreds of millions of >> > mutations that result in 1-100 megabytes of useful data after major >> > compaction. We ingest into Accumulo using MR from Mapper job. We >> > identified that performance really degrades while increasing a number >> of mutations. >> > >> > The obvious improvement is to do some calculations in-memory before >> > sending mutations to Accumulo. >> > >> > Of course, at the same time we are looking for a solution to minimize >> > development effort. >> > >> > I guess I am asking about micro compaction/ingest-time iterators on >> > the client side (before data is sent to Accumulo). >> > >> > To my understanding, Accumulo does not support them, is it correct? >> > And if so, are there any plans to support this functionality in the >> future? >> > >> > Thanks >> > >> > Roman >> > >> > Please consider the environment before printing this email. This >> > message should be regarded as confidential. If you have received this >> > email in error please notify the sender and destroy it immediately. >> > Statements of intent shall only become binding when confirmed in hard >> > copy by an authorised signatory. The contents of this email may relate >> > to dealings with other companies under the control of BAE Systems >> > Applied Intelligence Limited, details of which can be found at >> > http://www.baesystems.com/Businesses/index.htm. >> Please consider the environment before printing this email. This message >> should be regarded as confidential. If you have received this email in >> error please notify the sender and destroy it immediately. Statements of >> intent shall only become binding when confirmed in hard copy by an >> authorised signatory. The contents of this email may relate to dealings >> with other companies under the control of BAE Systems Applied Intelligence >> Limited, details of which can be found at >> http://www.baesystems.com/Businesses/index.htm. >> >> >> Please consider the environment before printing this email. This >> message should be regarded as confidential. If you have received this email >> in error please notify the sender and destroy it immediately. Statements of >> intent shall only become binding when confirmed in hard copy by an >> authorised signatory. The contents of this email may relate to dealings >> with other companies under the control of BAE Systems Applied Intelligence >> Limited, details of which can be found at >> http://www.baesystems.com/Businesses/index.htm. >> > >
