Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Stefán Baxter Sun, 17 Jan 2016 11:37:50 -0800

Hi,

All the numbers are smaller but the outcome is similar as in your
hypothetical.


A left-side query, without a join, takes takes <1 second (returns a few
thousand records from a few a collection of a few millions).
A  right-side query, Lucene without a join, looking up a list of values
(using IN), returns a few hundred records in ~300ms.. (Lookup for low
cardinality fields in 9+ million metadata entries)

The same query run using joins takes ~18 seconds to execute as that is the
time it takes to iterate through all the records on the Lucene side.

The raw iteration speed of Lucene/Elastic/Solar is not really an issue when
executing a single query but it will become one for us if all joins require
a full scan and usage increases.

Where in the JDBC connector should I start looking for the filter pushdown
support for joins?

- Stefan






On Sun, Jan 17, 2016 at 7:20 PM, Jacques Nadeau <[email protected]> wrote:

> Can you give more detail about the join stats themselves? You also state
> 20x slower but I'm trying to understand what that means. 20x slower than
> what? Are you parallelizing the Lucene read or is this a single reader?
>
> For example:
>
> I have a join.
> The left side has a billion rows.
> The right side has 10 million rows.
> When applying the join condition, only 10k rows are needed from the right
> side.
>
> How long does it take to read a few million records from Lucene? (Recently
> with Elastic we've been seeing ~50-100k/second per thread when only
> retrieving a single stored field.)
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Sat, Jan 16, 2016 at 12:11 PM, Stefán Baxter <[email protected]
> >
> wrote:
>
> > Hi Jacques,
> >
> > Thank you for taking the time, it's appreciated.
> >
> > I'm trying to contribute to the Lucene reader for Drill (Started by Rahul
> > Challapalli). We would like to use it for storage of metadata used in our
> > Drill setup.
> > This is perfectly suited for our needs as the metadata is already
> available
> > in Lucene document+indexes and it's tenant specific (So this is not the
> > global metadata that should reside in Postgres/HBase or something
> similar)
> >
> > I think it's best that I confess that I'm not sure what I'm looking for
> or
> > how to ask for it, at least not in proper Drill terms.
> >
> > The Lucene reader is working but the joins currently rely on full scan
> > which introduces ~20 time longer execution time on simple data sets (few
> > million records) so I need to get the index based joins going but I don't
> > know how.
> >
> > We have resources to do this now but our knowlidge of Drill is limited
> and
> > I could not, in my initial scan of the project, find any use
> > of DrillJoinRel that indicated indexes were involved (please forgive me
> if
> > this is a false assumption).
> >
> > Can you please clarify things for me a bit:
> >
> >    - Is the JDBC connector already doing proper pushdown of filters for
> >    joins? (If so then I must really get my reading glasses on)
> >    - What will change with this new approach.
> >
> > I'm not really sure what you need from me now but I'm more than happy to
> > share everything except the data it self :).
> >
> > The fork is places here:
> > https://github.com/activitystream/drill/tree/lucene-work but no tests
> > files
> > are included in the repo, sorry, and this is all very immature.
> >
> > Regards,
> >  -Stefán
> >
> >
> >
> >
> > On Sat, Jan 16, 2016 at 7:46 PM, Jacques Nadeau <[email protected]>
> > wrote:
> >
> > > Closest things already done to date is the join pushdown in the jdbc
> > > connector and the prototype code someone built a while back to do a
> join
> > > using HBase as a hash table. Aman and I have an ongoing thread
> discussing
> > > using elastic indexing and sideband communication to accelerate joins.
> If
> > > would be great if you could cover exactly what you're doing (including
> > > relevant stats), that would give us a better idea of how to point you
> in
> > > the right direction.
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Sat, Jan 16, 2016 at 5:18 AM, Stefán Baxter <
> > [email protected]>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Can anyone point me to an implementation where joins are implemented
> > with
> > > > full support for filters and efficient handling of joins based on
> > > indexes.
> > > >
> > > > The only code I have come across all seems to rely on complete scan
> of
> > > the
> > > > related table and that is not acceptable for the use case we are
> > working
> > > on
> > > > (Lucene reader).
> > > >
> > > > Regards,
> > > >  -Stefán
> > > >
> > >
> >
>

Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Reply via email to