Hi, All the numbers are smaller but the outcome is similar as in your hypothetical.
A left-side query, without a join, takes takes <1 second (returns a few thousand records from a few a collection of a few millions). A right-side query, Lucene without a join, looking up a list of values (using IN), returns a few hundred records in ~300ms.. (Lookup for low cardinality fields in 9+ million metadata entries) The same query run using joins takes ~18 seconds to execute as that is the time it takes to iterate through all the records on the Lucene side. The raw iteration speed of Lucene/Elastic/Solar is not really an issue when executing a single query but it will become one for us if all joins require a full scan and usage increases. Where in the JDBC connector should I start looking for the filter pushdown support for joins? - Stefan On Sun, Jan 17, 2016 at 7:20 PM, Jacques Nadeau <jacq...@dremio.com> wrote: > Can you give more detail about the join stats themselves? You also state > 20x slower but I'm trying to understand what that means. 20x slower than > what? Are you parallelizing the Lucene read or is this a single reader? > > For example: > > I have a join. > The left side has a billion rows. > The right side has 10 million rows. > When applying the join condition, only 10k rows are needed from the right > side. > > How long does it take to read a few million records from Lucene? (Recently > with Elastic we've been seeing ~50-100k/second per thread when only > retrieving a single stored field.) > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Sat, Jan 16, 2016 at 12:11 PM, Stefán Baxter <ste...@activitystream.com > > > wrote: > > > Hi Jacques, > > > > Thank you for taking the time, it's appreciated. > > > > I'm trying to contribute to the Lucene reader for Drill (Started by Rahul > > Challapalli). We would like to use it for storage of metadata used in our > > Drill setup. > > This is perfectly suited for our needs as the metadata is already > available > > in Lucene document+indexes and it's tenant specific (So this is not the > > global metadata that should reside in Postgres/HBase or something > similar) > > > > I think it's best that I confess that I'm not sure what I'm looking for > or > > how to ask for it, at least not in proper Drill terms. > > > > The Lucene reader is working but the joins currently rely on full scan > > which introduces ~20 time longer execution time on simple data sets (few > > million records) so I need to get the index based joins going but I don't > > know how. > > > > We have resources to do this now but our knowlidge of Drill is limited > and > > I could not, in my initial scan of the project, find any use > > of DrillJoinRel that indicated indexes were involved (please forgive me > if > > this is a false assumption). > > > > Can you please clarify things for me a bit: > > > > - Is the JDBC connector already doing proper pushdown of filters for > > joins? (If so then I must really get my reading glasses on) > > - What will change with this new approach. > > > > I'm not really sure what you need from me now but I'm more than happy to > > share everything except the data it self :). > > > > The fork is places here: > > https://github.com/activitystream/drill/tree/lucene-work but no tests > > files > > are included in the repo, sorry, and this is all very immature. > > > > Regards, > > -Stefán > > > > > > > > > > On Sat, Jan 16, 2016 at 7:46 PM, Jacques Nadeau <jacq...@dremio.com> > > wrote: > > > > > Closest things already done to date is the join pushdown in the jdbc > > > connector and the prototype code someone built a while back to do a > join > > > using HBase as a hash table. Aman and I have an ongoing thread > discussing > > > using elastic indexing and sideband communication to accelerate joins. > If > > > would be great if you could cover exactly what you're doing (including > > > relevant stats), that would give us a better idea of how to point you > in > > > the right direction. > > > > > > -- > > > Jacques Nadeau > > > CTO and Co-Founder, Dremio > > > > > > On Sat, Jan 16, 2016 at 5:18 AM, Stefán Baxter < > > ste...@activitystream.com> > > > wrote: > > > > > > > Hi, > > > > > > > > Can anyone point me to an implementation where joins are implemented > > with > > > > full support for filters and efficient handling of joins based on > > > indexes. > > > > > > > > The only code I have come across all seems to rely on complete scan > of > > > the > > > > related table and that is not acceptable for the use case we are > > working > > > on > > > > (Lucene reader). > > > > > > > > Regards, > > > > -Stefán > > > > > > > > > >