Re: Write data to a file from inside of an iterator

Dylan Hutchison Sat, 05 Nov 2016 11:04:14 -0700

Ah, the use case of Graphulo <http://graphulo.mit.edu/>'s OneTable
<https://github.com/Accla/graphulo/blob/master/src/main/java/edu/mit/ll/graphulo/Graphulo.java#L807>
call.
Internally the OneTable call sets up a special iterator
(RemoteWriteIterator) that does open a BatchWriter.  The main trick that
allows it to write entries safely is pushing row/column filters into the
iterator, so that the iterator controls re-seeking rather than Accumulo.
This allows the iterator to write all its entries and close() without
having to worry about Accumulo tearing it down.  See the docs
<https://github.com/Accla/graphulo/blob/master/docs/START_HERE_2016-03-28-Graphulo-UseDesign.pdf>
for a starter.

*cue Josh to warn against the evils of re-purposing tablet servers for
MapReduce cycles* =)

Really, this is advanced stuff.  Graphulo's iterators have been shown to
scale up to 16 nodes for matrix multiply in the last HPEC conference, but
it is possible your use case could break Accumulo, in the worst case
causing deadlock if you don't use it properly.  You're also free to write
your own code using Graphulo's code as a starting point, if you're more
comfortable with that.  You may also decide on another approach such as
launching a MapReduce job against Accumulo's RFiles, which could be better
or worse depending on your use case.

On Sat, Nov 5, 2016 at 10:28 AM, Yamini Joshi <[email protected]> wrote:

> Hello all
>
> As per https://github.com/apache/accumulo/blob/master/docs/src/
> main/asciidoc/chapters/iterator_design.txt
> "
> Implementations of Iterator might be tempted to open BatchWriters inside
> of an Iterator as a means
> to implement triggers for writing additional data outside of their client
> application. The lifecycle of an Iterator
> is *not* managed in such a way that guarantees that this is safe nor
> efficient. Specifically, there
> is no way to guarantee that the internal ThreadPool inside of the
> BatchWriter is closed (and the thread(s)
> are reaped) without calling the close() method. `close`'ing and recreating
> a `BatchWriter` after every
> Key-Value pair is also prohibitively performance limiting to be considered
> an option."
>
> If I need to write a subset of records generated from an iterator to a
> file/table, I can't use a batch writer inside of an iterator? Is there any
> other way to go about it?
>
> Best regards,
> Yamini Joshi
>

Re: Write data to a file from inside of an iterator

Reply via email to