Re: storage structure - querying directories - sanity check and UDF assistance

Stefán Baxter Fri, 24 Jul 2015 15:28:35 -0700

Hi,

I understand how this can be useful to deal with both row/record and
directory should for a result but then there is huge optimization potential
left unexploited.  (I'm not fully understanding if this "directory failing"
happens with more proof or not).
- If this does not eventually fail directories then I there is little or no
gain to be had when dealing with data partitioned this way.


I do know that when the right hand side of a failed dirN = fn('X')
statement contains no field references, only "static" query information
(like date range for the query), that it could be evaluated without
touching any other data than directory information.

Thanks for taking the time to educate/indulge me.

Regards,
 -Stefan




On Fri, Jul 24, 2015 at 10:10 PM, Jason Altekruse <altekruseja...@gmail.com>
wrote:

> A little clarification on that point. The directory filters are not
> syntactically separated from filters on regular columns that we read out of
> files themselves. Without optimization, the easiest way to think about the
> directory columns are just data that is added to each record coming out of
> the scan operation. Obviously this creates a lot of redundant copies of the
> same directory names, but in its simplest form this is what is meant to be
> provided by directory column the feature at execution time.
>
> What this allows users to do is to consider directory names as a part of
> their data. While the directory names are often representative of some
> property that will be filtered frequently in isolation, it is possible that
> a filter might involve both a directory and a regular column from the data
> files.
>
> For example, maybe you have date ranges, and typical filters are on the
> start date. It would make sense to partition on start date. In this case,
> filters on start dates can easily be evaluated at planning and provide scan
> pruning (find events starting in the last week, last 30 days, etc). If you
> store end dates in the files themselves, a query that needs to filter on an
> interval length could not be run as a pure directory filter. In this case
> we would need to do a scan of all of the files and evaluate the filter on
> each record.
>
> directory/
>      06-27-2015
>      06-29-2015
>
> data
> { "action" : "shipment", "end_date" : 07-01-2015" }
>
> We need to identify these cases as ones that involve directory columns, but
> cannot be evaluated when planning the scan. A filter on the just the
> directory column should be evaluated at planning time, lead to a scan over
> fewer files, and in the best case it will remove the filter condition on
> the directory column. This is possible because limiting the scan to a
> specific set of files should already guarantee that only data matching that
> filter is read. It however is not functionally necessary to remove the
> filter operation, it will just be re-evaluating the filter expression on
> all of the redundant copies of the directory names and all of the records
> should match the filter.
>
> It seems from your previous comments that you understand this process, but
> the important part I wanted to emphasize that there are cases where we
> cannot evaluate the directory filter before reading the data, because it
> also could have other columns in it. It is possible that the process to
> determine when the filter should or should not be removed is not working
> properly.
>
> If you can check to see if the plan shows a subset of the files are being
> scanned, we can at least be sure that pruning is happening appropriately.
> We should also look at the removal of this filter condition from the
> execution plan, but the directory filtering is more important to verify
> first as it will most likely have a larger impact on query performance.
>
> Obviously the indeterminate results are very concerning. Please share any
> additional details you can about reproducing this if you figure out
> anything more specific.
>
> On Fri, Jul 24, 2015 at 2:32 PM, Stefán Baxter <ste...@activitystream.com>
> wrote:
>
> > Hi Jason,
> >
> > I will share this code tomorrow on github so you can review this using
> that
> > if it helps.
> >
> > When I was testing this, earlier today, I saw, to my surprise, that the
> > query sometime returned results. This was not constant and I could run
> > exactly the same statement with two different results (no data ot part of
> > the test data).
> > I will let you know if I fins a reproducible pattern or at least a way to
> > produce this at all.
> >
> > I'm not sure I understand what "filter above the scan is not being
> removed
> > as it should be" means but I, without knowing anything about it, assumed
> > that an expression with a dirN on the left side of an expression would at
> > least "fail the whole directory" (hope that makes sense).
> >
> > I'll let you know when the code is in a good-enough state and I have
> pushed
> > it to github.
> >
> > Thanks for all the help guys, it's appreciated.
> >
> > Regards,
> >  -Stefan
> >
> >
> >
> > On Fri, Jul 24, 2015 at 8:46 PM, Jason Altekruse <
> altekruseja...@gmail.com
> > >
> > wrote:
> >
> > > I'm not sure, it is possible that it is being evaluated during planning
> > to
> > > prune the scan, but the filter above the scan is not being removed as
> it
> > > should be. I'll try to re-create it the case to take a look.
> > >
> > > Stefan,
> > > Earlier you had mentioned that it was not only inefficient, but it was
> > also
> > > giving you back data when none matched the filter condition. Is this
> > still
> > > an issue, or did the comments from Hakim and Jacques allow you to fix
> an
> > > issue with the UDF that was causing this?
> > >
> > > On Fri, Jul 24, 2015 at 11:03 AM, Jacques Nadeau <jacq...@dremio.com>
> > > wrote:
> > >
> > > >   - This is being called for *every record for every file in every
> > > >    directory*
> > > >
> > > > Are you sure?  Constant reduction should take care of this.  @Jason,
> > any
> > > > ideas why it might be failing?
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Fri, Jul 24, 2015 at 10:45 AM, Stefán Baxter <
> > > ste...@activitystream.com
> > > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > thanks for the tips.
> > > > >
> > > > > Observation:
> > > > >
> > > > >    - This is being called for *every record for every file in every
> > > > >    directory*
> > > > >
> > > > > Can you please tell me what needs to be done to make sure this is
> > only
> > > > > called 1 for each directory, preferably before file in that
> directory
> > > are
> > > > > opened/scanned.
> > > > >
> > > > > Regards,
> > > > >   -Stefán
> > > > >
> > > > > On Fri, Jul 24, 2015 at 3:28 PM, Jacques Nadeau <
> jacq...@dremio.com>
> > > > > wrote:
> > > > >
> > > > > > Two quick notes:
> > > > > >
> > > > > > - If you switch to internal null handling, you have to define
> > > separate
> > > > > udfs
> > > > > > for each possible combination of nullable and non-nullable
> values.
> > > > > > - isSet is an integer, so your if clause would actually be:
> > > > > >
> > > > > > if (! (yearDir.isSet == 1) ) {
> > > > > >   // yearDir is NULL, handle this here
> > > > > > }
> > > > > >
> > > > > > --
> > > > > > Jacques Nadeau
> > > > > > CTO and Co-Founder, Dremio
> > > > > >
> > > > > > On Fri, Jul 24, 2015 at 8:10 AM, Abdel Hakim Deneche <
> > > > > > adene...@maprtech.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Stehan,
> > > > > > >
> > > > > > > I think when you specify your UDF as NULL_IF_NULL it means
> Drill
> > > will
> > > > > > > handle null values automatically: if any passed argument to
> your
> > > UDF
> > > > is
> > > > > > > NULL, the UDF won't be evaluated and Drill will return NULL
> > > instead.
> > > > > > >
> > > > > > > In your case your UDF need to handle NULL values by setting:
> > > > > > >
> > > > > > > nulls = NullHandling.INTERNAL
> > > > > > >
> > > > > > > Then inside your UDF you can check if a passed argument is null
> > > like
> > > > > > this:
> > > > > > >
> > > > > > > if (!yearDir.isSet) {
> > > > > > >   // yearDir is NULL, handle this here
> > > > > > > }
> > > > > > >
> > > > > > > Let me know if this works for you, I didn't work on UDFs for
> > quite
> > > > some
> > > > > > > time now and they may have slightly changed since then.
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jul 24, 2015 at 7:37 AM, Stefán Baxter <
> > > > > > ste...@activitystream.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I have this running now:
> > > > > > > >
> > > > > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > > > > dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
> > > > > > > > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11'
> as
> > > > > > > >
> > > > timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-'))
> > > > > > > order
> > > > > > > > by occurred_at;"
> > > > > > > >
> > > > > > > >
> > > > > > > > Observations:
> > > > > > > >
> > > > > > > >    - Being able to do this without using COALESCE would be
> more
> > > > > > readable
> > > > > > > >
> > > > > > > >    - This is being called for every record for every file in
> > > every
> > > > > > > > directory
> > > > > > > >    - I would love for this expression only being evaluated as
> > > "the
> > > > > > start
> > > > > > > of
> > > > > > > >    a new directory"
> > > > > > > >
> > > > > > > >    - This function always return 1 and should never match
> dir0
> > > > > > > >    - For some strange reason this query sometimes returns
> > results
> > > > > even
> > > > > > > >    though it should never do that (dir0 != 1)
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >  -Stefan
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter <
> > > > > > > ste...@activitystream.com
> > > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I would like to share our intentions for organizing our
> data
> > > and
> > > > > how
> > > > > > we
> > > > > > > > > plan to construct queries for it.
> > > > > > > > >
> > > > > > > > > There are four main reasons for sharing this:
> > > > > > > > >
> > > > > > > > > a) I would like to sanity check the approach
> > > > > > > > > b) I'm having a hard time writing a UDF to optimize this
> and
> > > > need a
> > > > > > bit
> > > > > > > > of
> > > > > > > > > help.
> > > > > > > > >
> > > > > > > > > c) This can perhaps benefit someone else
> > > > > > > > >
> > > > > > > > > d) Check if UDF like the one I'm writing already exists
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Relevant facts:
> > > > > > > > >
> > > > > > > > >    - We have multiple tenants and they collect variable
> > amount
> > > of
> > > > > > data
> > > > > > > > >    (from almost non to quite a lot)
> > > > > > > > >    - Each tenant has multiple data-sources (aka
> time-series /
> > > > > tables)
> > > > > > > > >    - Data is processed to Parquet at variable intervals
> > > depending
> > > > > on
> > > > > > > the
> > > > > > > > >    tenant and his volume
> > > > > > > > >    - Parquet files can not be updated/appended to in drill
> > > > (Create
> > > > > > from
> > > > > > > > >    creates a new directory (aka table))
> > > > > > > > >
> > > > > > > > > The structure we intent to use
> > > > > > > > >
> > > > > > > > >    - / <root-directory> / [processed|streaming] / <tenant>
> /
> > > > > > > > >    <data-source> / <year> / <month> / <date>
> > > > > > > > >    example:
> > > > > "/analytics/processed/some-tenant-name/events/2015/07/24"
> > > > > > > > >
> > > > > > > > >    - where *processed* is the sub-directory used for
> parquet
> > > > files
> > > > > > and
> > > > > > > > >    the *streaming sub-directory is used for "raw materials"
> > > > > > > > >    (json/csv/logs etc.)*
> > > > > > > > >    this structure also allows us to have dev-nodes with
> local
> > > > > > streaming
> > > > > > > > >    directory and remote (hdfs/s3) processed directory
> > > > > > > > >
> > > > > > > > > It will vary between tenant how often data is processed and
> > how
> > > > > data
> > > > > > is
> > > > > > > > > stored:
> > > > > > > > >
> > > > > > > > >    - small tenant:
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
> > > > > > > > >    - medium tenant:
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
> > > > > > > > >    - ... (you get the picture)
> > > > > > > > >
> > > > > > > > >    *Important:* this means that in some cases we will have
> > > empty
> > > > or
> > > > > > > > >    missing sub-directories and in some cases we will have
> all
> > > the
> > > > > > data
> > > > > > > > in the
> > > > > > > > >    root directory but the queries we run will always be the
> > > same
> > > > > ones
> > > > > > > > >    (regardless of tenant specific storage "layout").
> > > > > > > > >
> > > > > > > > > When we do selects we use *union *to query both *processed*
> > and
> > > > > > > > *streaming
> > > > > > > > > directory structured *and then merge the data (if needed
> (for
> > > > > > grouping
> > > > > > > > > etc)).
> > > > > > > > >
> > > > > > > > > This is all working just fine (so far, so good).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Now I'm creating a small UDF that is supposed to help with
> > the
> > > > > > scanning
> > > > > > > > of
> > > > > > > > > these directories.
> > > > > > > > >
> > > > > > > > > Requirements:
> > > > > > > > >
> > > > > > > > >    - Avoid scanning/opening files in directories that are
> > > > > irrelevant
> > > > > > > > >    - Include files in every directory of full/partial
> > relevance
> > > > > > > > >    - minimize the additional overhead of including this UDF
> > in
> > > > the
> > > > > > SQL
> > > > > > > > >    command.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The UDF is called dirInRange and it takes this parameters:
> > > > > > > > >
> > > > > > > > >    - @Param TimeStampHolder from;
> > > > > > > > >    - @Param NullableTimeStampHolder to;
> > > > > > > > >    - @Param NullableVarCharHolder yearDir;
> > > > > > > > >    - @Param NullableVarCharHolder monthDir;
> > > > > > > > >    - @Param NullableVarCharHolder dayDir;
> > > > > > > > >
> > > > > > > > > It currently returns a int (1 if in range and other values
> > > > > depending
> > > > > > on
> > > > > > > > > what failed)
> > > > > > > > >
> > > > > > > > > The function is defined like this
> > > > > > > > >
> > > > > > > > > @FunctionTemplate(name = "dirInRange", scope =
> > > > > > > > FunctionTemplate.FunctionScope.SIMPLE, nulls =
> > > > > > > > FunctionTemplate.NullHandling.NULL_IF_NULL)
> > > > > > > > >
> > > > > > > > > public static class dirInRange implements DrillSimpleFunc
> > > > > > > > >
> > > > > > > > > ... and its called like this:
> > > > > > > > >
> > > > > > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > > > > > dfs.tmp.`/analytics/processed/test/events` as t where
> > > > > > > > *dirInRange*(cast('2015-04-10'
> > > > > > > > > as timestamp),cast('2015-07-11' as
> timestamp),dir0,dir1,dir2)
> > > = 1
> > > > > > order
> > > > > > > > by
> > > > > > > > > occurred_at;"
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > This it not working and I have several questions:
> > > > > > > > >
> > > > > > > > >    - Is it possible that this function is only called
> if/when
> > > > dir0,
> > > > > > > dir1,
> > > > > > > > >    and dir2 all exists
> > > > > > > > >    - If that is the case would it not be better that these
> > > > > > > > >    utility-variables existed with null value until
> > needed/used
> > > > > > > > >    - having a Nullable*Holder seems to play no role in this
> > at
> > > > all
> > > > > > > > >
> > > > > > > > >    - Is there anything I can do to have this expression
> > > evaluated
> > > > > > only
> > > > > > > > >    once per directory?
> > > > > > > > >
> > > > > > > > >    - Is there anything I can do to have this evaluated
> before
> > > the
> > > > > > files
> > > > > > > > >    in the directory are processed?
> > > > > > > > >
> > > > > > > > >    - How do I pass null value into this?
> > > > > > > > >    - this fails:  *dirInRange*(cast('2015-04-10' as
> > > > > > > > >    timestamp),null,dir0,dir1,dir2)
> > > > > > > > >    Error: PARSE ERROR: From line 1, column 82 to line 1,
> > column
> > > > 85:
> > > > > > > > >    Illegal use of 'NULL'
> > > > > > > > >
> > > > > > > > > This will be a part of my mini UDF library (
> > > > > > > > > https://github.com/acmeguy/asdrill), feel free to
> use/abuse
> > it
> > > > if
> > > > > it
> > > > > > > > > helps.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >  -Stefán
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Abdelhakim Deneche
> > > > > > >
> > > > > > > Software Engineer
> > > > > > >
> > > > > > >   <http://www.mapr.com/>
> > > > > > >
> > > > > > >
> > > > > > > Now Available - Free Hadoop On-Demand Training
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: storage structure - querying directories - sanity check and UDF assistance

Reply via email to