Hi Rahul, I'm aware of the segment parallelization and the option of rewriting the queries but I disagree with that being the best option.
Since Drill supports push down of join filters I think our best option is to implement that in the Lucene reader. Rewriting the queries ma be a temporary option but we are already using sub queries for more complex things and I really need these simple lookup joins to be both simple and effective. - Stefan On Sun, Jan 17, 2016 at 7:44 PM, rahul challapalli < challapallira...@gmail.com> wrote: > The level of parallelization in the lucene plugin is a segment. > > Stefan, > > I think it would be more accurate if you rewrite your join query so that we > push the join keys into the lucene group scan and then compare the numbers. > Something like the below > > select * from tbl1 a left join (select * from tbl2 where tbl2.col1 in > (select col1 from tbl1)) b where a.col1 = b.col1; > > - Rahul > > On Sun, Jan 17, 2016 at 11:20 AM, Jacques Nadeau <jacq...@dremio.com> > wrote: > > > Can you give more detail about the join stats themselves? You also state > > 20x slower but I'm trying to understand what that means. 20x slower than > > what? Are you parallelizing the Lucene read or is this a single reader? > > > > For example: > > > > I have a join. > > The left side has a billion rows. > > The right side has 10 million rows. > > When applying the join condition, only 10k rows are needed from the right > > side. > > > > How long does it take to read a few million records from Lucene? > (Recently > > with Elastic we've been seeing ~50-100k/second per thread when only > > retrieving a single stored field.) > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Sat, Jan 16, 2016 at 12:11 PM, Stefán Baxter < > ste...@activitystream.com > > > > > wrote: > > > > > Hi Jacques, > > > > > > Thank you for taking the time, it's appreciated. > > > > > > I'm trying to contribute to the Lucene reader for Drill (Started by > Rahul > > > Challapalli). We would like to use it for storage of metadata used in > our > > > Drill setup. > > > This is perfectly suited for our needs as the metadata is already > > available > > > in Lucene document+indexes and it's tenant specific (So this is not the > > > global metadata that should reside in Postgres/HBase or something > > similar) > > > > > > I think it's best that I confess that I'm not sure what I'm looking for > > or > > > how to ask for it, at least not in proper Drill terms. > > > > > > The Lucene reader is working but the joins currently rely on full scan > > > which introduces ~20 time longer execution time on simple data sets > (few > > > million records) so I need to get the index based joins going but I > don't > > > know how. > > > > > > We have resources to do this now but our knowlidge of Drill is limited > > and > > > I could not, in my initial scan of the project, find any use > > > of DrillJoinRel that indicated indexes were involved (please forgive me > > if > > > this is a false assumption). > > > > > > Can you please clarify things for me a bit: > > > > > > - Is the JDBC connector already doing proper pushdown of filters for > > > joins? (If so then I must really get my reading glasses on) > > > - What will change with this new approach. > > > > > > I'm not really sure what you need from me now but I'm more than happy > to > > > share everything except the data it self :). > > > > > > The fork is places here: > > > https://github.com/activitystream/drill/tree/lucene-work but no tests > > > files > > > are included in the repo, sorry, and this is all very immature. > > > > > > Regards, > > > -Stefán > > > > > > > > > > > > > > > On Sat, Jan 16, 2016 at 7:46 PM, Jacques Nadeau <jacq...@dremio.com> > > > wrote: > > > > > > > Closest things already done to date is the join pushdown in the jdbc > > > > connector and the prototype code someone built a while back to do a > > join > > > > using HBase as a hash table. Aman and I have an ongoing thread > > discussing > > > > using elastic indexing and sideband communication to accelerate > joins. > > If > > > > would be great if you could cover exactly what you're doing > (including > > > > relevant stats), that would give us a better idea of how to point you > > in > > > > the right direction. > > > > > > > > -- > > > > Jacques Nadeau > > > > CTO and Co-Founder, Dremio > > > > > > > > On Sat, Jan 16, 2016 at 5:18 AM, Stefán Baxter < > > > ste...@activitystream.com> > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > Can anyone point me to an implementation where joins are > implemented > > > with > > > > > full support for filters and efficient handling of joins based on > > > > indexes. > > > > > > > > > > The only code I have come across all seems to rely on complete scan > > of > > > > the > > > > > related table and that is not acceptable for the use case we are > > > working > > > > on > > > > > (Lucene reader). > > > > > > > > > > Regards, > > > > > -Stefán > > > > > > > > > > > > > > >