Re: Apache Accumulo integrated with Presto

Dylan Hutchison Sun, 12 Jun 2016 22:37:00 -0700

Thanks for sharing Sean.  Here are some notes I wrote after reading the
article on Presto-Accumulo design.  I have a research interest in the
relationship between relational (SQL) and non-relational (Accumulo)
systems, so I couldn't resist reading the post in detail.

   - Places the primary key in the Accumulo row.
   - Performs row-at-a-time processing (each tuple is one row in Accumulo)
   using WholeRowIterator behavior.
   - Relational table metadata is stored in the Presto infrastructure (as
   opposed to an Accumulo table).
   - Supports the creation of index tables for any attributes. These index
   tables speed up queries that filter on indexed attributes.  It is standard
   secondary indexing, which provides speedups when the selectivity of the
   query is roughly <10% of the original table.
   - Only database->client querying is supported.  You cannot run "select
   ... into result_table".
   - As far as I can see, Presto only has one join strategy: *broadcast
   join*.  The right table of every join is scanned into one of the Presto
   worker's memory.  Subsequently the size of the right table is limited by
   worker memory.
   - There is one Presto worker for each Accumulo tablet, which enables
   good scaling.
   - The Presto bridge classes track internal Accumulo information such as
   the assignment of tablets to tablet servers by reading Accumulo's Metadata
   table. Presto uses tablet locations to provide better locality.
   - The Presto bridge comes with several Accumulo server-side iterators
   for filtering and aggregating.
   - The code is quite nice and clean.

This image below gives Presto's architecture.  Accumulo takes the role of
the DB icon in the bottom-right corner.

[image: Inline image 2]

Bloomberg ran 13 out of the 22 TPC-H queries.  There is no fundamental
reason why they cannot run all the queries; they just have not implemented
everything required ('exists' clauses, non-equi join, etc.).

The interface looks like this, though they use a compiled java jar to
insert entries from a csv file (it wraps around a BatchWriter).

[image: Inline image 3]

Here are performance results.  They don't say what hardware or data sizes
they use.  Whatever it is, they must have the ability to fit the smaller
table of any join into memory as a result of Presto's broadcast join
strategy.  The strong scaling looks very nice.

[image: Inline image 4]

They have one other plot that shows how secondary indexing speeds up some
queries with low selectivity.

Cheers, Dylan

On Sun, Jun 12, 2016 at 7:06 PM, Sean Busbey <bus...@cloudera.com> wrote:

> Bloomberg have a post about a connector they made to query Accumulo from
> Presto:
>
>
> http://www.bloomberg.com/company/announcements/open-source-at-bloomberg-reducing-application-development-time-via-presto-accumulo/
>
> --
> Sean Busbey
>

Re: Apache Accumulo integrated with Presto

Reply via email to