Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Stefán Baxter Sun, 17 Jan 2016 11:51:06 -0800

Hi Rahul,

I'm aware of the segment parallelization and the option of rewriting the
queries but I disagree with that being the best option.


Since Drill supports push down of join filters I think our best option is
to implement that in the Lucene reader.

Rewriting the queries ma be a temporary option but we are already using sub
queries for more complex things and I really need these simple lookup joins
to be both simple and effective.

- Stefan

On Sun, Jan 17, 2016 at 7:44 PM, rahul challapalli <
challapallira...@gmail.com> wrote:

> The level of parallelization in the lucene plugin is a segment.
>
> Stefan,
>
> I think it would be more accurate if you rewrite your join query so that we
> push the join keys into the lucene group scan and then compare the numbers.
> Something like the below
>
>    select * from tbl1 a left join (select * from tbl2 where tbl2.col1 in
> (select col1 from tbl1)) b where a.col1 = b.col1;
>
> - Rahul
>
> On Sun, Jan 17, 2016 at 11:20 AM, Jacques Nadeau <jacq...@dremio.com>
> wrote:
>
> > Can you give more detail about the join stats themselves? You also state
> > 20x slower but I'm trying to understand what that means. 20x slower than
> > what? Are you parallelizing the Lucene read or is this a single reader?
> >
> > For example:
> >
> > I have a join.
> > The left side has a billion rows.
> > The right side has 10 million rows.
> > When applying the join condition, only 10k rows are needed from the right
> > side.
> >
> > How long does it take to read a few million records from Lucene?
> (Recently
> > with Elastic we've been seeing ~50-100k/second per thread when only
> > retrieving a single stored field.)
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Sat, Jan 16, 2016 at 12:11 PM, Stefán Baxter <
> ste...@activitystream.com
> > >
> > wrote:
> >
> > > Hi Jacques,
> > >
> > > Thank you for taking the time, it's appreciated.
> > >
> > > I'm trying to contribute to the Lucene reader for Drill (Started by
> Rahul
> > > Challapalli). We would like to use it for storage of metadata used in
> our
> > > Drill setup.
> > > This is perfectly suited for our needs as the metadata is already
> > available
> > > in Lucene document+indexes and it's tenant specific (So this is not the
> > > global metadata that should reside in Postgres/HBase or something
> > similar)
> > >
> > > I think it's best that I confess that I'm not sure what I'm looking for
> > or
> > > how to ask for it, at least not in proper Drill terms.
> > >
> > > The Lucene reader is working but the joins currently rely on full scan
> > > which introduces ~20 time longer execution time on simple data sets
> (few
> > > million records) so I need to get the index based joins going but I
> don't
> > > know how.
> > >
> > > We have resources to do this now but our knowlidge of Drill is limited
> > and
> > > I could not, in my initial scan of the project, find any use
> > > of DrillJoinRel that indicated indexes were involved (please forgive me
> > if
> > > this is a false assumption).
> > >
> > > Can you please clarify things for me a bit:
> > >
> > >    - Is the JDBC connector already doing proper pushdown of filters for
> > >    joins? (If so then I must really get my reading glasses on)
> > >    - What will change with this new approach.
> > >
> > > I'm not really sure what you need from me now but I'm more than happy
> to
> > > share everything except the data it self :).
> > >
> > > The fork is places here:
> > > https://github.com/activitystream/drill/tree/lucene-work but no tests
> > > files
> > > are included in the repo, sorry, and this is all very immature.
> > >
> > > Regards,
> > >  -Stefán
> > >
> > >
> > >
> > >
> > > On Sat, Jan 16, 2016 at 7:46 PM, Jacques Nadeau <jacq...@dremio.com>
> > > wrote:
> > >
> > > > Closest things already done to date is the join pushdown in the jdbc
> > > > connector and the prototype code someone built a while back to do a
> > join
> > > > using HBase as a hash table. Aman and I have an ongoing thread
> > discussing
> > > > using elastic indexing and sideband communication to accelerate
> joins.
> > If
> > > > would be great if you could cover exactly what you're doing
> (including
> > > > relevant stats), that would give us a better idea of how to point you
> > in
> > > > the right direction.
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Sat, Jan 16, 2016 at 5:18 AM, Stefán Baxter <
> > > ste...@activitystream.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Can anyone point me to an implementation where joins are
> implemented
> > > with
> > > > > full support for filters and efficient handling of joins based on
> > > > indexes.
> > > > >
> > > > > The only code I have come across all seems to rely on complete scan
> > of
> > > > the
> > > > > related table and that is not acceptable for the use case we are
> > > working
> > > > on
> > > > > (Lucene reader).
> > > > >
> > > > > Regards,
> > > > >  -Stefán
> > > > >
> > > >
> > >
> >
>

Re: Efficient joins in Drill - avoiding the massive overhead of scan based joins

Reply via email to