Re: storage structure - querying directories - sanity check and UDF assistance

Stefán Baxter Fri, 24 Jul 2015 14:34:07 -0700

Hi Jason,

I will share this code tomorrow on github so you can review this using that
if it helps.


When I was testing this, earlier today, I saw, to my surprise, that the
query sometime returned results. This was not constant and I could run
exactly the same statement with two different results (no data ot part of
the test data).
I will let you know if I fins a reproducible pattern or at least a way to
produce this at all.

I'm not sure I understand what "filter above the scan is not being removed
as it should be" means but I, without knowing anything about it, assumed
that an expression with a dirN on the left side of an expression would at
least "fail the whole directory" (hope that makes sense).

I'll let you know when the code is in a good-enough state and I have pushed
it to github.

Thanks for all the help guys, it's appreciated.

Regards,
 -Stefan



On Fri, Jul 24, 2015 at 8:46 PM, Jason Altekruse <altekruseja...@gmail.com>
wrote:

> I'm not sure, it is possible that it is being evaluated during planning to
> prune the scan, but the filter above the scan is not being removed as it
> should be. I'll try to re-create it the case to take a look.
>
> Stefan,
> Earlier you had mentioned that it was not only inefficient, but it was also
> giving you back data when none matched the filter condition. Is this still
> an issue, or did the comments from Hakim and Jacques allow you to fix an
> issue with the UDF that was causing this?
>
> On Fri, Jul 24, 2015 at 11:03 AM, Jacques Nadeau <jacq...@dremio.com>
> wrote:
>
> >   - This is being called for *every record for every file in every
> >    directory*
> >
> > Are you sure?  Constant reduction should take care of this.  @Jason, any
> > ideas why it might be failing?
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Fri, Jul 24, 2015 at 10:45 AM, Stefán Baxter <
> ste...@activitystream.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > thanks for the tips.
> > >
> > > Observation:
> > >
> > >    - This is being called for *every record for every file in every
> > >    directory*
> > >
> > > Can you please tell me what needs to be done to make sure this is only
> > > called 1 for each directory, preferably before file in that directory
> are
> > > opened/scanned.
> > >
> > > Regards,
> > >   -Stefán
> > >
> > > On Fri, Jul 24, 2015 at 3:28 PM, Jacques Nadeau <jacq...@dremio.com>
> > > wrote:
> > >
> > > > Two quick notes:
> > > >
> > > > - If you switch to internal null handling, you have to define
> separate
> > > udfs
> > > > for each possible combination of nullable and non-nullable values.
> > > > - isSet is an integer, so your if clause would actually be:
> > > >
> > > > if (! (yearDir.isSet == 1) ) {
> > > >   // yearDir is NULL, handle this here
> > > > }
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Fri, Jul 24, 2015 at 8:10 AM, Abdel Hakim Deneche <
> > > > adene...@maprtech.com>
> > > > wrote:
> > > >
> > > > > Hi Stehan,
> > > > >
> > > > > I think when you specify your UDF as NULL_IF_NULL it means Drill
> will
> > > > > handle null values automatically: if any passed argument to your
> UDF
> > is
> > > > > NULL, the UDF won't be evaluated and Drill will return NULL
> instead.
> > > > >
> > > > > In your case your UDF need to handle NULL values by setting:
> > > > >
> > > > > nulls = NullHandling.INTERNAL
> > > > >
> > > > > Then inside your UDF you can check if a passed argument is null
> like
> > > > this:
> > > > >
> > > > > if (!yearDir.isSet) {
> > > > >   // yearDir is NULL, handle this here
> > > > > }
> > > > >
> > > > > Let me know if this works for you, I didn't work on UDFs for quite
> > some
> > > > > time now and they may have slightly changed since then.
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > On Fri, Jul 24, 2015 at 7:37 AM, Stefán Baxter <
> > > > ste...@activitystream.com>
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have this running now:
> > > > > >
> > > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > > dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
> > > > > > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
> > > > > >
> > timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-'))
> > > > > order
> > > > > > by occurred_at;"
> > > > > >
> > > > > >
> > > > > > Observations:
> > > > > >
> > > > > >    - Being able to do this without using COALESCE would be more
> > > > readable
> > > > > >
> > > > > >    - This is being called for every record for every file in
> every
> > > > > > directory
> > > > > >    - I would love for this expression only being evaluated as
> "the
> > > > start
> > > > > of
> > > > > >    a new directory"
> > > > > >
> > > > > >    - This function always return 1 and should never match dir0
> > > > > >    - For some strange reason this query sometimes returns results
> > > even
> > > > > >    though it should never do that (dir0 != 1)
> > > > > >
> > > > > > Regards,
> > > > > >  -Stefan
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter <
> > > > > ste...@activitystream.com
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I would like to share our intentions for organizing our data
> and
> > > how
> > > > we
> > > > > > > plan to construct queries for it.
> > > > > > >
> > > > > > > There are four main reasons for sharing this:
> > > > > > >
> > > > > > > a) I would like to sanity check the approach
> > > > > > > b) I'm having a hard time writing a UDF to optimize this and
> > need a
> > > > bit
> > > > > > of
> > > > > > > help.
> > > > > > >
> > > > > > > c) This can perhaps benefit someone else
> > > > > > >
> > > > > > > d) Check if UDF like the one I'm writing already exists
> > > > > > >
> > > > > > >
> > > > > > > Relevant facts:
> > > > > > >
> > > > > > >    - We have multiple tenants and they collect variable amount
> of
> > > > data
> > > > > > >    (from almost non to quite a lot)
> > > > > > >    - Each tenant has multiple data-sources (aka time-series /
> > > tables)
> > > > > > >    - Data is processed to Parquet at variable intervals
> depending
> > > on
> > > > > the
> > > > > > >    tenant and his volume
> > > > > > >    - Parquet files can not be updated/appended to in drill
> > (Create
> > > > from
> > > > > > >    creates a new directory (aka table))
> > > > > > >
> > > > > > > The structure we intent to use
> > > > > > >
> > > > > > >    - / <root-directory> / [processed|streaming] / <tenant> /
> > > > > > >    <data-source> / <year> / <month> / <date>
> > > > > > >    example:
> > > "/analytics/processed/some-tenant-name/events/2015/07/24"
> > > > > > >
> > > > > > >    - where *processed* is the sub-directory used for parquet
> > files
> > > > and
> > > > > > >    the *streaming sub-directory is used for "raw materials"
> > > > > > >    (json/csv/logs etc.)*
> > > > > > >    this structure also allows us to have dev-nodes with local
> > > > streaming
> > > > > > >    directory and remote (hdfs/s3) processed directory
> > > > > > >
> > > > > > > It will vary between tenant how often data is processed and how
> > > data
> > > > is
> > > > > > > stored:
> > > > > > >
> > > > > > >    - small tenant:
> > > > > > >
> > > > > >
> > > >
> /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
> > > > > > >    - medium tenant:
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
> > > > > > >    - ... (you get the picture)
> > > > > > >
> > > > > > >    *Important:* this means that in some cases we will have
> empty
> > or
> > > > > > >    missing sub-directories and in some cases we will have all
> the
> > > > data
> > > > > > in the
> > > > > > >    root directory but the queries we run will always be the
> same
> > > ones
> > > > > > >    (regardless of tenant specific storage "layout").
> > > > > > >
> > > > > > > When we do selects we use *union *to query both *processed* and
> > > > > > *streaming
> > > > > > > directory structured *and then merge the data (if needed (for
> > > > grouping
> > > > > > > etc)).
> > > > > > >
> > > > > > > This is all working just fine (so far, so good).
> > > > > > >
> > > > > > >
> > > > > > > Now I'm creating a small UDF that is supposed to help with the
> > > > scanning
> > > > > > of
> > > > > > > these directories.
> > > > > > >
> > > > > > > Requirements:
> > > > > > >
> > > > > > >    - Avoid scanning/opening files in directories that are
> > > irrelevant
> > > > > > >    - Include files in every directory of full/partial relevance
> > > > > > >    - minimize the additional overhead of including this UDF in
> > the
> > > > SQL
> > > > > > >    command.
> > > > > > >
> > > > > > >
> > > > > > > The UDF is called dirInRange and it takes this parameters:
> > > > > > >
> > > > > > >    - @Param TimeStampHolder from;
> > > > > > >    - @Param NullableTimeStampHolder to;
> > > > > > >    - @Param NullableVarCharHolder yearDir;
> > > > > > >    - @Param NullableVarCharHolder monthDir;
> > > > > > >    - @Param NullableVarCharHolder dayDir;
> > > > > > >
> > > > > > > It currently returns a int (1 if in range and other values
> > > depending
> > > > on
> > > > > > > what failed)
> > > > > > >
> > > > > > > The function is defined like this
> > > > > > >
> > > > > > > @FunctionTemplate(name = "dirInRange", scope =
> > > > > > FunctionTemplate.FunctionScope.SIMPLE, nulls =
> > > > > > FunctionTemplate.NullHandling.NULL_IF_NULL)
> > > > > > >
> > > > > > > public static class dirInRange implements DrillSimpleFunc
> > > > > > >
> > > > > > > ... and its called like this:
> > > > > > >
> > > > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > > > dfs.tmp.`/analytics/processed/test/events` as t where
> > > > > > *dirInRange*(cast('2015-04-10'
> > > > > > > as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2)
> = 1
> > > > order
> > > > > > by
> > > > > > > occurred_at;"
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > This it not working and I have several questions:
> > > > > > >
> > > > > > >    - Is it possible that this function is only called if/when
> > dir0,
> > > > > dir1,
> > > > > > >    and dir2 all exists
> > > > > > >    - If that is the case would it not be better that these
> > > > > > >    utility-variables existed with null value until needed/used
> > > > > > >    - having a Nullable*Holder seems to play no role in this at
> > all
> > > > > > >
> > > > > > >    - Is there anything I can do to have this expression
> evaluated
> > > > only
> > > > > > >    once per directory?
> > > > > > >
> > > > > > >    - Is there anything I can do to have this evaluated before
> the
> > > > files
> > > > > > >    in the directory are processed?
> > > > > > >
> > > > > > >    - How do I pass null value into this?
> > > > > > >    - this fails:  *dirInRange*(cast('2015-04-10' as
> > > > > > >    timestamp),null,dir0,dir1,dir2)
> > > > > > >    Error: PARSE ERROR: From line 1, column 82 to line 1, column
> > 85:
> > > > > > >    Illegal use of 'NULL'
> > > > > > >
> > > > > > > This will be a part of my mini UDF library (
> > > > > > > https://github.com/acmeguy/asdrill), feel free to use/abuse it
> > if
> > > it
> > > > > > > helps.
> > > > > > >
> > > > > > > Regards,
> > > > > > >  -Stefán
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Abdelhakim Deneche
> > > > >
> > > > > Software Engineer
> > > > >
> > > > >   <http://www.mapr.com/>
> > > > >
> > > > >
> > > > > Now Available - Free Hadoop On-Demand Training
> > > > > <
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: storage structure - querying directories - sanity check and UDF assistance

Reply via email to