Hi Jason, I will share this code tomorrow on github so you can review this using that if it helps.
When I was testing this, earlier today, I saw, to my surprise, that the query sometime returned results. This was not constant and I could run exactly the same statement with two different results (no data ot part of the test data). I will let you know if I fins a reproducible pattern or at least a way to produce this at all. I'm not sure I understand what "filter above the scan is not being removed as it should be" means but I, without knowing anything about it, assumed that an expression with a dirN on the left side of an expression would at least "fail the whole directory" (hope that makes sense). I'll let you know when the code is in a good-enough state and I have pushed it to github. Thanks for all the help guys, it's appreciated. Regards, -Stefan On Fri, Jul 24, 2015 at 8:46 PM, Jason Altekruse <altekruseja...@gmail.com> wrote: > I'm not sure, it is possible that it is being evaluated during planning to > prune the scan, but the filter above the scan is not being removed as it > should be. I'll try to re-create it the case to take a look. > > Stefan, > Earlier you had mentioned that it was not only inefficient, but it was also > giving you back data when none matched the filter condition. Is this still > an issue, or did the comments from Hakim and Jacques allow you to fix an > issue with the UDF that was causing this? > > On Fri, Jul 24, 2015 at 11:03 AM, Jacques Nadeau <jacq...@dremio.com> > wrote: > > > - This is being called for *every record for every file in every > > directory* > > > > Are you sure? Constant reduction should take care of this. @Jason, any > > ideas why it might be failing? > > > > -- > > Jacques Nadeau > > CTO and Co-Founder, Dremio > > > > On Fri, Jul 24, 2015 at 10:45 AM, Stefán Baxter < > ste...@activitystream.com > > > > > wrote: > > > > > Hi, > > > > > > thanks for the tips. > > > > > > Observation: > > > > > > - This is being called for *every record for every file in every > > > directory* > > > > > > Can you please tell me what needs to be done to make sure this is only > > > called 1 for each directory, preferably before file in that directory > are > > > opened/scanned. > > > > > > Regards, > > > -Stefán > > > > > > On Fri, Jul 24, 2015 at 3:28 PM, Jacques Nadeau <jacq...@dremio.com> > > > wrote: > > > > > > > Two quick notes: > > > > > > > > - If you switch to internal null handling, you have to define > separate > > > udfs > > > > for each possible combination of nullable and non-nullable values. > > > > - isSet is an integer, so your if clause would actually be: > > > > > > > > if (! (yearDir.isSet == 1) ) { > > > > // yearDir is NULL, handle this here > > > > } > > > > > > > > -- > > > > Jacques Nadeau > > > > CTO and Co-Founder, Dremio > > > > > > > > On Fri, Jul 24, 2015 at 8:10 AM, Abdel Hakim Deneche < > > > > adene...@maprtech.com> > > > > wrote: > > > > > > > > > Hi Stehan, > > > > > > > > > > I think when you specify your UDF as NULL_IF_NULL it means Drill > will > > > > > handle null values automatically: if any passed argument to your > UDF > > is > > > > > NULL, the UDF won't be evaluated and Drill will return NULL > instead. > > > > > > > > > > In your case your UDF need to handle NULL values by setting: > > > > > > > > > > nulls = NullHandling.INTERNAL > > > > > > > > > > Then inside your UDF you can check if a passed argument is null > like > > > > this: > > > > > > > > > > if (!yearDir.isSet) { > > > > > // yearDir is NULL, handle this here > > > > > } > > > > > > > > > > Let me know if this works for you, I didn't work on UDFs for quite > > some > > > > > time now and they may have slightly changed since then. > > > > > > > > > > Thanks > > > > > > > > > > > > > > > On Fri, Jul 24, 2015 at 7:37 AM, Stefán Baxter < > > > > ste...@activitystream.com> > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I have this running now: > > > > > > > > > > > > "select occurred_at, dir0, dir1, dir2 from > > > > > > dfs.tmp.`/analytics/processed/test/events` as t where dir0 = > > > > > > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as > > > > > > > > timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-')) > > > > > order > > > > > > by occurred_at;" > > > > > > > > > > > > > > > > > > Observations: > > > > > > > > > > > > - Being able to do this without using COALESCE would be more > > > > readable > > > > > > > > > > > > - This is being called for every record for every file in > every > > > > > > directory > > > > > > - I would love for this expression only being evaluated as > "the > > > > start > > > > > of > > > > > > a new directory" > > > > > > > > > > > > - This function always return 1 and should never match dir0 > > > > > > - For some strange reason this query sometimes returns results > > > even > > > > > > though it should never do that (dir0 != 1) > > > > > > > > > > > > Regards, > > > > > > -Stefan > > > > > > > > > > > > > > > > > > On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter < > > > > > ste...@activitystream.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > I would like to share our intentions for organizing our data > and > > > how > > > > we > > > > > > > plan to construct queries for it. > > > > > > > > > > > > > > There are four main reasons for sharing this: > > > > > > > > > > > > > > a) I would like to sanity check the approach > > > > > > > b) I'm having a hard time writing a UDF to optimize this and > > need a > > > > bit > > > > > > of > > > > > > > help. > > > > > > > > > > > > > > c) This can perhaps benefit someone else > > > > > > > > > > > > > > d) Check if UDF like the one I'm writing already exists > > > > > > > > > > > > > > > > > > > > > Relevant facts: > > > > > > > > > > > > > > - We have multiple tenants and they collect variable amount > of > > > > data > > > > > > > (from almost non to quite a lot) > > > > > > > - Each tenant has multiple data-sources (aka time-series / > > > tables) > > > > > > > - Data is processed to Parquet at variable intervals > depending > > > on > > > > > the > > > > > > > tenant and his volume > > > > > > > - Parquet files can not be updated/appended to in drill > > (Create > > > > from > > > > > > > creates a new directory (aka table)) > > > > > > > > > > > > > > The structure we intent to use > > > > > > > > > > > > > > - / <root-directory> / [processed|streaming] / <tenant> / > > > > > > > <data-source> / <year> / <month> / <date> > > > > > > > example: > > > "/analytics/processed/some-tenant-name/events/2015/07/24" > > > > > > > > > > > > > > - where *processed* is the sub-directory used for parquet > > files > > > > and > > > > > > > the *streaming sub-directory is used for "raw materials" > > > > > > > (json/csv/logs etc.)* > > > > > > > this structure also allows us to have dev-nodes with local > > > > streaming > > > > > > > directory and remote (hdfs/s3) processed directory > > > > > > > > > > > > > > It will vary between tenant how often data is processed and how > > > data > > > > is > > > > > > > stored: > > > > > > > > > > > > > > - small tenant: > > > > > > > > > > > > > > > > > > /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet > > > > > > > - medium tenant: > > > > > > > > > > > > > > > > > > > > > > > > > > > > /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet > > > > > > > - ... (you get the picture) > > > > > > > > > > > > > > *Important:* this means that in some cases we will have > empty > > or > > > > > > > missing sub-directories and in some cases we will have all > the > > > > data > > > > > > in the > > > > > > > root directory but the queries we run will always be the > same > > > ones > > > > > > > (regardless of tenant specific storage "layout"). > > > > > > > > > > > > > > When we do selects we use *union *to query both *processed* and > > > > > > *streaming > > > > > > > directory structured *and then merge the data (if needed (for > > > > grouping > > > > > > > etc)). > > > > > > > > > > > > > > This is all working just fine (so far, so good). > > > > > > > > > > > > > > > > > > > > > Now I'm creating a small UDF that is supposed to help with the > > > > scanning > > > > > > of > > > > > > > these directories. > > > > > > > > > > > > > > Requirements: > > > > > > > > > > > > > > - Avoid scanning/opening files in directories that are > > > irrelevant > > > > > > > - Include files in every directory of full/partial relevance > > > > > > > - minimize the additional overhead of including this UDF in > > the > > > > SQL > > > > > > > command. > > > > > > > > > > > > > > > > > > > > > The UDF is called dirInRange and it takes this parameters: > > > > > > > > > > > > > > - @Param TimeStampHolder from; > > > > > > > - @Param NullableTimeStampHolder to; > > > > > > > - @Param NullableVarCharHolder yearDir; > > > > > > > - @Param NullableVarCharHolder monthDir; > > > > > > > - @Param NullableVarCharHolder dayDir; > > > > > > > > > > > > > > It currently returns a int (1 if in range and other values > > > depending > > > > on > > > > > > > what failed) > > > > > > > > > > > > > > The function is defined like this > > > > > > > > > > > > > > @FunctionTemplate(name = "dirInRange", scope = > > > > > > FunctionTemplate.FunctionScope.SIMPLE, nulls = > > > > > > FunctionTemplate.NullHandling.NULL_IF_NULL) > > > > > > > > > > > > > > public static class dirInRange implements DrillSimpleFunc > > > > > > > > > > > > > > ... and its called like this: > > > > > > > > > > > > > > "select occurred_at, dir0, dir1, dir2 from > > > > > > > dfs.tmp.`/analytics/processed/test/events` as t where > > > > > > *dirInRange*(cast('2015-04-10' > > > > > > > as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2) > = 1 > > > > order > > > > > > by > > > > > > > occurred_at;" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This it not working and I have several questions: > > > > > > > > > > > > > > - Is it possible that this function is only called if/when > > dir0, > > > > > dir1, > > > > > > > and dir2 all exists > > > > > > > - If that is the case would it not be better that these > > > > > > > utility-variables existed with null value until needed/used > > > > > > > - having a Nullable*Holder seems to play no role in this at > > all > > > > > > > > > > > > > > - Is there anything I can do to have this expression > evaluated > > > > only > > > > > > > once per directory? > > > > > > > > > > > > > > - Is there anything I can do to have this evaluated before > the > > > > files > > > > > > > in the directory are processed? > > > > > > > > > > > > > > - How do I pass null value into this? > > > > > > > - this fails: *dirInRange*(cast('2015-04-10' as > > > > > > > timestamp),null,dir0,dir1,dir2) > > > > > > > Error: PARSE ERROR: From line 1, column 82 to line 1, column > > 85: > > > > > > > Illegal use of 'NULL' > > > > > > > > > > > > > > This will be a part of my mini UDF library ( > > > > > > > https://github.com/acmeguy/asdrill), feel free to use/abuse it > > if > > > it > > > > > > > helps. > > > > > > > > > > > > > > Regards, > > > > > > > -Stefán > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Abdelhakim Deneche > > > > > > > > > > Software Engineer > > > > > > > > > > <http://www.mapr.com/> > > > > > > > > > > > > > > > Now Available - Free Hadoop On-Demand Training > > > > > < > > > > > > > > > > > > > > > http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available > > > > > > > > > > > > > > > > > > > > >