I think the generic hash-join strategy is, for some small set A, we can
send the whole set to partitions of a larger set B and do the join in
parallel. In this case, whichever is the smaller set would be consumed on
some worker, and then distributed out to each worker participating in the
hash join. Outside of Presto, this is often done in an iterator where the
smaller set is an argument to an iterator.

On Mon, Jun 13, 2016 at 4:03 PM, Dylan Hutchison <dhutc...@cs.washington.edu
> wrote:

> Thanks for clarifying Adam.
>
> I am interested in learning more about the hash-join strategy, in case
> you're familiar with them.  Suppose we want to join the PartSupp and
> Supplier table on ps_suppkey = s_suppkey.  The s_suppkey is a primary key
> of the Supplier table and it is stored in the Accumulo row.  The ps_suppkey
> is neither a key nor stored in the row of the PartSupp table.  (The
> PartSupp table's row is a UUID.)
>
> Is the hash-join strategy to (1) scan tuples (whole rows) from PartSupp to
> a Presto worker, (2) for a batch of PartSupp tuples fetch the
> matching Supplier tuples, (3) repeat until all tuples are read from
> PartSupp?
>
> Regards, Dylan
>
> On Mon, Jun 13, 2016 at 8:24 AM, Adam J. Shook <adamjsh...@gmail.com>
> wrote:
>
> > A few clarifications:
> >
> > - Presto supports hash-based distributed joins as well as broadcast joins
> >
> > - Presto metadata is stored in ZooKeeper, but metadata storage is
> pluggable
> > and could be stored in Accumulo instead
> >
> > - The connector does use tablet locality when scanning Accumulo, but our
> > testing has shown you get better performance by giving Accumulo and
> Presto
> > their own dedicated machines, making locality a moot point.  This will
> > certainly change based on types of queries, data sizes, network quality,
> > etc.
> >
> > - You can insert the results of a query into a Presto table using INSERT
> > INTO foo SELECT ..., as well as create a table from the results of a
> query
> > (CTAS).  Though, for large inserts, it is typically best to bypass the
> > Presto layer and insert directly into the Accumulo tables using the
> > PrestoBatchWriter API
> >
> > Cheers,
> > --Adam
> >
> > On Mon, Jun 13, 2016 at 7:20 AM, Christopher <ctubb...@apache.org>
> wrote:
> >
> > > Thanks for that summary, Dylan! Very helpful.
> > >
> > > On Mon, Jun 13, 2016, 01:36 Dylan Hutchison <
> dhutc...@cs.washington.edu>
> > > wrote:
> > >
> > > > Thanks for sharing Sean.  Here are some notes I wrote after reading
> the
> > > > article on Presto-Accumulo design.  I have a research interest in the
> > > > relationship between relational (SQL) and non-relational (Accumulo)
> > > > systems, so I couldn't resist reading the post in detail.
> > > >
> > > >    - Places the primary key in the Accumulo row.
> > > >    - Performs row-at-a-time processing (each tuple is one row in
> > > >    Accumulo) using WholeRowIterator behavior.
> > > >    - Relational table metadata is stored in the Presto infrastructure
> > (as
> > > >    opposed to an Accumulo table).
> > > >    - Supports the creation of index tables for any attributes. These
> > > >    index tables speed up queries that filter on indexed attributes.
> It
> > > is
> > > >    standard secondary indexing, which provides speedups when the
> > > selectivity
> > > >    of the query is roughly <10% of the original table.
> > > >    - Only database->client querying is supported.  You cannot run
> > "select
> > > >    ... into result_table".
> > > >    - As far as I can see, Presto only has one join strategy:
> *broadcast
> > > >    join*.  The right table of every join is scanned into one of the
> > > >    Presto worker's memory.  Subsequently the size of the right table
> is
> > > >    limited by worker memory.
> > > >    - There is one Presto worker for each Accumulo tablet, which
> enables
> > > >    good scaling.
> > > >    - The Presto bridge classes track internal Accumulo information
> such
> > > >    as the assignment of tablets to tablet servers by reading
> Accumulo's
> > > >    Metadata table. Presto uses tablet locations to provide better
> > > locality.
> > > >    - The Presto bridge comes with several Accumulo server-side
> > iterators
> > > >    for filtering and aggregating.
> > > >    - The code is quite nice and clean.
> > > >
> > > > This image below gives Presto's architecture.  Accumulo takes the
> role
> > of
> > > > the DB icon in the bottom-right corner.
> > > >
> > > > [image: Inline image 2]
> > > >
> > > > Bloomberg ran 13 out of the 22 TPC-H queries.  There is no
> fundamental
> > > > reason why they cannot run all the queries; they just have not
> > > implemented
> > > > everything required ('exists' clauses, non-equi join, etc.).
> > > >
> > > > The interface looks like this, though they use a compiled java jar to
> > > > insert entries from a csv file (it wraps around a BatchWriter).
> > > >
> > > > [image: Inline image 3]
> > > >
> > > > Here are performance results.  They don't say what hardware or data
> > sizes
> > > > they use.  Whatever it is, they must have the ability to fit the
> > smaller
> > > > table of any join into memory as a result of Presto's broadcast join
> > > > strategy.  The strong scaling looks very nice.
> > > >
> > > > [image: Inline image 4]
> > > >
> > > > They have one other plot that shows how secondary indexing speeds up
> > some
> > > > queries with low selectivity.
> > > >
> > > > Cheers, Dylan
> > > >
> > > >
> > > >
> > > > On Sun, Jun 12, 2016 at 7:06 PM, Sean Busbey <bus...@cloudera.com>
> > > wrote:
> > > >
> > > >> Bloomberg have a post about a connector they made to query Accumulo
> > from
> > > >> Presto:
> > > >>
> > > >>
> > > >>
> > >
> >
> http://www.bloomberg.com/company/announcements/open-source-at-bloomberg-reducing-application-development-time-via-presto-accumulo/
> > > >>
> > > >> --
> > > >> Sean Busbey
> > > >>
> > > >
> > > >
> > >
> >
>

Reply via email to