Re: Design-for-comment: Accumulo Server-Side Computation: Stored Procedure Tables starring SpGEMM

Dylan Hutchison Thu, 26 Feb 2015 12:07:33 -0800

Hi Christopher,  responses after yours:

It's a clever way to leverage existing Accumulo behaviors (full major
> compaction) to act as clients in order to perform a parallel operation to
> populate a new table. Have you tried this method in practice at all, yet?
> What pitfalls have you run into, perhaps regarding client-side static state
> in the JVM, or resource management issues within the tablet servers?

So far I have had success implementing a variant of the design doc:
everything except the RemoteWriteIterator.  I ran two RemoteSourceIterator
<https://github.com/Accla/d4m_api_java/blob/master/src/main/java/edu/mit/ll/graphulo/RemoteSourceIterator.java>s
and a DotMultIterator
<https://github.com/Accla/d4m_api_java/blob/master/src/main/java/edu/mit/ll/graphulo/DotRemoteSourceIterator.java>
on a new Accumulo table via a one-time manual major compaction.  It works
as expected.  Test code is here for RemoteSource
<https://github.com/Accla/d4m_api_java/blob/master/src/test/java/edu/mit/ll/graphulo/RemoteIteratorTest.java>
and here for the Dot
<https://github.com/Accla/d4m_api_java/blob/master/src/test/java/edu/mit/ll/graphulo/DotRemoteIteratorTest.java>.
That said, the tests are a very small size.  I expect unforeseen issues,
maybe the ones you describe, will arise when we scale up.

Later this week I will prototype the design doc as written, using a stored
procedure table with table splits and iterators on major compaction and
does *not* store any data.

What do you think the similarities and differences are with other parallel
> execution methods that one could use to achieve the same results (like
> Map/Reduce)?

Similar to Yarn, MapReduce is great if we wanted to run computations over
an entire table, i.e. if we valued high throughput over low latency.  The
design doc addresses the reverse use case, when we want to select subsets
of a table and value low latency over high throughput.  This is where
Accumulo lends us strength.  There is also a chance we can stream results
if our computation preserves sorted order-- see the temporary table section
<https://github.com/Accla/accumulo_stored_procedure_design#temporary-tables>
.

Also, do you have any code available for an example RemoteSourceIterator,
> which we might be able to try? The Transpose one seemed simple enough, but
> any others would be neat to try, also.

See above and the links in the design doc.  Here is a RemoteMergeIterator
<https://github.com/Accla/d4m_api_java/blob/master/src/main/java/edu/mit/ll/graphulo/RemoteMergeIterator.java>.
There is some mess in the code as I shifted between designs.

Do you have any thoughts on whether there should be some abstract base
> class available in Accumulo (vs. as part of the contrib) to support these
> iterators and handle the boiler-plate stuff of setting up/serializing the
> client configuration when the procedure executes, or utilities to help
> create a stored procedure table?

Not sure yet.  I can report back to the Accumulo community after one pass
implementing the design.  For now, I'd like to see the Accumulo community's
opinion on the design and merit.
  The temporary table concept may be worth Accumulo core consideration.

I wonder if one result of this project is writing a guideline /
best-practices document on *where to place iterators, *or in other words,
where to place computation.

Regards,
Dylan Hutchison

On Thu, Feb 26, 2015 at 2:34 PM, Christopher <ctubb...@apache.org> wrote:

> Hi Dylan,
>
> It's a clever way to leverage existing Accumulo behaviors (full major
> compaction) to act as clients in order to perform a parallel operation to
> populate a new table. Have you tried this method in practice at all, yet?
> What pitfalls have you run into, perhaps regarding client-side static state
> in the JVM, or resource management issues within the tablet servers? What
> do you think the similarities and differences are with other parallel
> execution methods that one could use to achieve the same results (like
> Map/Reduce)?
>
> Also, do you have any code available for an example RemoteSourceIterator,
> which we might be able to try? The Transpose one seemed simple enough, but
> any others would be neat to try, also.
>
> Do you have any thoughts on whether there should be some abstract base
> class available in Accumulo (vs. as part of the contrib) to support these
> iterators and handle the boiler-plate stuff of setting up/serializing the
> client configuration when the procedure executes, or utilities to help
> create a stored procedure table?
>
>
> --
> Christopher L Tubbs II
> http://gravatar.com/ctubbsii
>
> On Thu, Feb 26, 2015 at 12:42 AM, Dylan Hutchison <dhutc...@mit.edu>
> wrote:
>
>> Hello all,
>>
>> As promised
>> <https://mail-archives.apache.org/mod_mbox/accumulo-user/201502.mbox/%3CCAPx%3DJkakO3ice7vbH%2BeUo%2B6AP1JPebVbTDu%2Bg71KV8SvQ4J9WA%40mail.gmail.com%3E>,
>> here is a design doc open for comments on implementing server-side
>> computation in Accumulo.
>>
>> https://github.com/Accla/accumulo_stored_procedure_design
>>
>> Would love to hear your opinion, especially if the proposed design
>> pattern matches one of *your use cases*.
>>
>> Regards,
>> Dylan Hutchison
>>
>>
>

Re: Design-for-comment: Accumulo Server-Side Computation: Stored Procedure Tables starring SpGEMM

Reply via email to