Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Jacques Nadeau Sun, 17 Jan 2016 11:21:06 -0800

Can you give more detail about the join stats themselves? You also state
20x slower but I'm trying to understand what that means. 20x slower than
what? Are you parallelizing the Lucene read or is this a single reader?


For example:

I have a join.
The left side has a billion rows.
The right side has 10 million rows.
When applying the join condition, only 10k rows are needed from the right
side.

How long does it take to read a few million records from Lucene? (Recently
with Elastic we've been seeing ~50-100k/second per thread when only
retrieving a single stored field.)

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sat, Jan 16, 2016 at 12:11 PM, Stefán Baxter <[email protected]>
wrote:

> Hi Jacques,
>
> Thank you for taking the time, it's appreciated.
>
> I'm trying to contribute to the Lucene reader for Drill (Started by Rahul
> Challapalli). We would like to use it for storage of metadata used in our
> Drill setup.
> This is perfectly suited for our needs as the metadata is already available
> in Lucene document+indexes and it's tenant specific (So this is not the
> global metadata that should reside in Postgres/HBase or something similar)
>
> I think it's best that I confess that I'm not sure what I'm looking for or
> how to ask for it, at least not in proper Drill terms.
>
> The Lucene reader is working but the joins currently rely on full scan
> which introduces ~20 time longer execution time on simple data sets (few
> million records) so I need to get the index based joins going but I don't
> know how.
>
> We have resources to do this now but our knowlidge of Drill is limited and
> I could not, in my initial scan of the project, find any use
> of DrillJoinRel that indicated indexes were involved (please forgive me if
> this is a false assumption).
>
> Can you please clarify things for me a bit:
>
>    - Is the JDBC connector already doing proper pushdown of filters for
>    joins? (If so then I must really get my reading glasses on)
>    - What will change with this new approach.
>
> I'm not really sure what you need from me now but I'm more than happy to
> share everything except the data it self :).
>
> The fork is places here:
> https://github.com/activitystream/drill/tree/lucene-work but no tests
> files
> are included in the repo, sorry, and this is all very immature.
>
> Regards,
>  -Stefán
>
>
>
>
> On Sat, Jan 16, 2016 at 7:46 PM, Jacques Nadeau <[email protected]>
> wrote:
>
> > Closest things already done to date is the join pushdown in the jdbc
> > connector and the prototype code someone built a while back to do a join
> > using HBase as a hash table. Aman and I have an ongoing thread discussing
> > using elastic indexing and sideband communication to accelerate joins. If
> > would be great if you could cover exactly what you're doing (including
> > relevant stats), that would give us a better idea of how to point you in
> > the right direction.
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Sat, Jan 16, 2016 at 5:18 AM, Stefán Baxter <
> [email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > Can anyone point me to an implementation where joins are implemented
> with
> > > full support for filters and efficient handling of joins based on
> > indexes.
> > >
> > > The only code I have come across all seems to rely on complete scan of
> > the
> > > related table and that is not acceptable for the use case we are
> working
> > on
> > > (Lucene reader).
> > >
> > > Regards,
> > >  -Stefán
> > >
> >
>

Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Reply via email to