I think this might be the same concept as in-mapper combining, but applied to data being sent to a BatchWriter rather than an OutputCollector. See [1], section 3.1.1. A similar performance analysis and probably a lot of the same code should apply here.
Cheers, Adam [1] http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf On Tue, Jun 9, 2015 at 1:02 PM, Russ Weeks <[email protected]> wrote: > Having a combiner stack (more generally an iterator stack) run on the > client-side seems to be the second most popular request on this list. The > most popular being, "How do I write to Accumulo from inside an iterator?" > > Such a thing would be very useful for me, too. I have some cycles to help > out, if somebody can give me an idea of where to get started and where the > potential land-mines are. > > -Russ > > On Tue, Jun 9, 2015 at 9:08 AM [email protected] < > [email protected]> wrote: > >> Aggregated output is tiny, so if I do same calculations in memory >> (instead of sending mutations to Accumulo) , I can reduce overall number of >> mutations by 1000x or so >> >> >> >> -----Original Message----- >> From: Josh Elser [mailto:[email protected]] >> Sent: 09 June 2015 16:54 >> To: [email protected] >> Subject: Re: micro compaction >> >> Well, you win the prize for new terminology. I haven't ever heard the >> term "micro compaction" before. >> >> Can you clarify though, you say hundreds of millions of mutations that >> result in megabytes of data. Is that an increase or decrease in size. >> Comparing apples to oranges :) >> >> [email protected] wrote: >> > Hi guys, >> > >> > While doing pre-analytics we generate hundreds of millions of >> > mutations that result in 1-100 megabytes of useful data after major >> > compaction. We ingest into Accumulo using MR from Mapper job. We >> > identified that performance really degrades while increasing a number >> of mutations. >> > >> > The obvious improvement is to do some calculations in-memory before >> > sending mutations to Accumulo. >> > >> > Of course, at the same time we are looking for a solution to minimize >> > development effort. >> > >> > I guess I am asking about micro compaction/ingest-time iterators on >> > the client side (before data is sent to Accumulo). >> > >> > To my understanding, Accumulo does not support them, is it correct? >> > And if so, are there any plans to support this functionality in the >> future? >> > >> > Thanks >> > >> > Roman >> > >> > Please consider the environment before printing this email. This >> > message should be regarded as confidential. If you have received this >> > email in error please notify the sender and destroy it immediately. >> > Statements of intent shall only become binding when confirmed in hard >> > copy by an authorised signatory. The contents of this email may relate >> > to dealings with other companies under the control of BAE Systems >> > Applied Intelligence Limited, details of which can be found at >> > http://www.baesystems.com/Businesses/index.htm. >> Please consider the environment before printing this email. This message >> should be regarded as confidential. If you have received this email in >> error please notify the sender and destroy it immediately. Statements of >> intent shall only become binding when confirmed in hard copy by an >> authorised signatory. The contents of this email may relate to dealings >> with other companies under the control of BAE Systems Applied Intelligence >> Limited, details of which can be found at >> http://www.baesystems.com/Businesses/index.htm. >> >
