Re: storage structure - querying directories - sanity check and UDF assistance

Jason Altekruse Fri, 24 Jul 2015 15:13:14 -0700

A little clarification on that point. The directory filters are not
syntactically separated from filters on regular columns that we read out of
files themselves. Without optimization, the easiest way to think about the
directory columns are just data that is added to each record coming out of
the scan operation. Obviously this creates a lot of redundant copies of the
same directory names, but in its simplest form this is what is meant to be
provided by directory column the feature at execution time.


What this allows users to do is to consider directory names as a part of
their data. While the directory names are often representative of some
property that will be filtered frequently in isolation, it is possible that
a filter might involve both a directory and a regular column from the data
files.

For example, maybe you have date ranges, and typical filters are on the
start date. It would make sense to partition on start date. In this case,
filters on start dates can easily be evaluated at planning and provide scan
pruning (find events starting in the last week, last 30 days, etc). If you
store end dates in the files themselves, a query that needs to filter on an
interval length could not be run as a pure directory filter. In this case
we would need to do a scan of all of the files and evaluate the filter on
each record.

directory/
     06-27-2015
     06-29-2015

data
{ "action" : "shipment", "end_date" : 07-01-2015" }

We need to identify these cases as ones that involve directory columns, but
cannot be evaluated when planning the scan. A filter on the just the
directory column should be evaluated at planning time, lead to a scan over
fewer files, and in the best case it will remove the filter condition on
the directory column. This is possible because limiting the scan to a
specific set of files should already guarantee that only data matching that
filter is read. It however is not functionally necessary to remove the
filter operation, it will just be re-evaluating the filter expression on
all of the redundant copies of the directory names and all of the records
should match the filter.

It seems from your previous comments that you understand this process, but
the important part I wanted to emphasize that there are cases where we
cannot evaluate the directory filter before reading the data, because it
also could have other columns in it. It is possible that the process to
determine when the filter should or should not be removed is not working
properly.

If you can check to see if the plan shows a subset of the files are being
scanned, we can at least be sure that pruning is happening appropriately.
We should also look at the removal of this filter condition from the
execution plan, but the directory filtering is more important to verify
first as it will most likely have a larger impact on query performance.

Obviously the indeterminate results are very concerning. Please share any
additional details you can about reproducing this if you figure out
anything more specific.

On Fri, Jul 24, 2015 at 2:32 PM, Stefán Baxter <ste...@activitystream.com>
wrote:

> Hi Jason,
>
> I will share this code tomorrow on github so you can review this using that
> if it helps.
>
> When I was testing this, earlier today, I saw, to my surprise, that the
> query sometime returned results. This was not constant and I could run
> exactly the same statement with two different results (no data ot part of
> the test data).
> I will let you know if I fins a reproducible pattern or at least a way to
> produce this at all.
>
> I'm not sure I understand what "filter above the scan is not being removed
> as it should be" means but I, without knowing anything about it, assumed
> that an expression with a dirN on the left side of an expression would at
> least "fail the whole directory" (hope that makes sense).
>
> I'll let you know when the code is in a good-enough state and I have pushed
> it to github.
>
> Thanks for all the help guys, it's appreciated.
>
> Regards,
>  -Stefan
>
>
>
> On Fri, Jul 24, 2015 at 8:46 PM, Jason Altekruse <altekruseja...@gmail.com
> >
> wrote:
>
> > I'm not sure, it is possible that it is being evaluated during planning
> to
> > prune the scan, but the filter above the scan is not being removed as it
> > should be. I'll try to re-create it the case to take a look.
> >
> > Stefan,
> > Earlier you had mentioned that it was not only inefficient, but it was
> also
> > giving you back data when none matched the filter condition. Is this
> still
> > an issue, or did the comments from Hakim and Jacques allow you to fix an
> > issue with the UDF that was causing this?
> >
> > On Fri, Jul 24, 2015 at 11:03 AM, Jacques Nadeau <jacq...@dremio.com>
> > wrote:
> >
> > >   - This is being called for *every record for every file in every
> > >    directory*
> > >
> > > Are you sure?  Constant reduction should take care of this.  @Jason,
> any
> > > ideas why it might be failing?
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Fri, Jul 24, 2015 at 10:45 AM, Stefán Baxter <
> > ste...@activitystream.com
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > thanks for the tips.
> > > >
> > > > Observation:
> > > >
> > > >    - This is being called for *every record for every file in every
> > > >    directory*
> > > >
> > > > Can you please tell me what needs to be done to make sure this is
> only
> > > > called 1 for each directory, preferably before file in that directory
> > are
> > > > opened/scanned.
> > > >
> > > > Regards,
> > > >   -Stefán
> > > >
> > > > On Fri, Jul 24, 2015 at 3:28 PM, Jacques Nadeau <jacq...@dremio.com>
> > > > wrote:
> > > >
> > > > > Two quick notes:
> > > > >
> > > > > - If you switch to internal null handling, you have to define
> > separate
> > > > udfs
> > > > > for each possible combination of nullable and non-nullable values.
> > > > > - isSet is an integer, so your if clause would actually be:
> > > > >
> > > > > if (! (yearDir.isSet == 1) ) {
> > > > >   // yearDir is NULL, handle this here
> > > > > }
> > > > >
> > > > > --
> > > > > Jacques Nadeau
> > > > > CTO and Co-Founder, Dremio
> > > > >
> > > > > On Fri, Jul 24, 2015 at 8:10 AM, Abdel Hakim Deneche <
> > > > > adene...@maprtech.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Stehan,
> > > > > >
> > > > > > I think when you specify your UDF as NULL_IF_NULL it means Drill
> > will
> > > > > > handle null values automatically: if any passed argument to your
> > UDF
> > > is
> > > > > > NULL, the UDF won't be evaluated and Drill will return NULL
> > instead.
> > > > > >
> > > > > > In your case your UDF need to handle NULL values by setting:
> > > > > >
> > > > > > nulls = NullHandling.INTERNAL
> > > > > >
> > > > > > Then inside your UDF you can check if a passed argument is null
> > like
> > > > > this:
> > > > > >
> > > > > > if (!yearDir.isSet) {
> > > > > >   // yearDir is NULL, handle this here
> > > > > > }
> > > > > >
> > > > > > Let me know if this works for you, I didn't work on UDFs for
> quite
> > > some
> > > > > > time now and they may have slightly changed since then.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 24, 2015 at 7:37 AM, Stefán Baxter <
> > > > > ste...@activitystream.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have this running now:
> > > > > > >
> > > > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > > > dfs.tmp.`/analytics/processed/test/events` as t where dir0 =
> > > > > > > dirInRange(cast('2015-04-10' as timestamp),cast('2015-07-11' as
> > > > > > >
> > > timestamp),COALESCE(dir0,'-'),COALESCE(dir1,'-'),COALESCE(dir2,'-'))
> > > > > > order
> > > > > > > by occurred_at;"
> > > > > > >
> > > > > > >
> > > > > > > Observations:
> > > > > > >
> > > > > > >    - Being able to do this without using COALESCE would be more
> > > > > readable
> > > > > > >
> > > > > > >    - This is being called for every record for every file in
> > every
> > > > > > > directory
> > > > > > >    - I would love for this expression only being evaluated as
> > "the
> > > > > start
> > > > > > of
> > > > > > >    a new directory"
> > > > > > >
> > > > > > >    - This function always return 1 and should never match dir0
> > > > > > >    - For some strange reason this query sometimes returns
> results
> > > > even
> > > > > > >    though it should never do that (dir0 != 1)
> > > > > > >
> > > > > > > Regards,
> > > > > > >  -Stefan
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jul 24, 2015 at 12:12 PM, Stefán Baxter <
> > > > > > ste...@activitystream.com
> > > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > I would like to share our intentions for organizing our data
> > and
> > > > how
> > > > > we
> > > > > > > > plan to construct queries for it.
> > > > > > > >
> > > > > > > > There are four main reasons for sharing this:
> > > > > > > >
> > > > > > > > a) I would like to sanity check the approach
> > > > > > > > b) I'm having a hard time writing a UDF to optimize this and
> > > need a
> > > > > bit
> > > > > > > of
> > > > > > > > help.
> > > > > > > >
> > > > > > > > c) This can perhaps benefit someone else
> > > > > > > >
> > > > > > > > d) Check if UDF like the one I'm writing already exists
> > > > > > > >
> > > > > > > >
> > > > > > > > Relevant facts:
> > > > > > > >
> > > > > > > >    - We have multiple tenants and they collect variable
> amount
> > of
> > > > > data
> > > > > > > >    (from almost non to quite a lot)
> > > > > > > >    - Each tenant has multiple data-sources (aka time-series /
> > > > tables)
> > > > > > > >    - Data is processed to Parquet at variable intervals
> > depending
> > > > on
> > > > > > the
> > > > > > > >    tenant and his volume
> > > > > > > >    - Parquet files can not be updated/appended to in drill
> > > (Create
> > > > > from
> > > > > > > >    creates a new directory (aka table))
> > > > > > > >
> > > > > > > > The structure we intent to use
> > > > > > > >
> > > > > > > >    - / <root-directory> / [processed|streaming] / <tenant> /
> > > > > > > >    <data-source> / <year> / <month> / <date>
> > > > > > > >    example:
> > > > "/analytics/processed/some-tenant-name/events/2015/07/24"
> > > > > > > >
> > > > > > > >    - where *processed* is the sub-directory used for parquet
> > > files
> > > > > and
> > > > > > > >    the *streaming sub-directory is used for "raw materials"
> > > > > > > >    (json/csv/logs etc.)*
> > > > > > > >    this structure also allows us to have dev-nodes with local
> > > > > streaming
> > > > > > > >    directory and remote (hdfs/s3) processed directory
> > > > > > > >
> > > > > > > > It will vary between tenant how often data is processed and
> how
> > > > data
> > > > > is
> > > > > > > > stored:
> > > > > > > >
> > > > > > > >    - small tenant:
> > > > > > > >
> > > > > > >
> > > > >
> > /analytics/processed/tenantA/pageviews/2014/on-year-partitioned.parquet
> > > > > > > >    - medium tenant:
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /analytics/processed/tenantA/pageviews/2015/06/on-month-partitioned.parquet
> > > > > > > >    - ... (you get the picture)
> > > > > > > >
> > > > > > > >    *Important:* this means that in some cases we will have
> > empty
> > > or
> > > > > > > >    missing sub-directories and in some cases we will have all
> > the
> > > > > data
> > > > > > > in the
> > > > > > > >    root directory but the queries we run will always be the
> > same
> > > > ones
> > > > > > > >    (regardless of tenant specific storage "layout").
> > > > > > > >
> > > > > > > > When we do selects we use *union *to query both *processed*
> and
> > > > > > > *streaming
> > > > > > > > directory structured *and then merge the data (if needed (for
> > > > > grouping
> > > > > > > > etc)).
> > > > > > > >
> > > > > > > > This is all working just fine (so far, so good).
> > > > > > > >
> > > > > > > >
> > > > > > > > Now I'm creating a small UDF that is supposed to help with
> the
> > > > > scanning
> > > > > > > of
> > > > > > > > these directories.
> > > > > > > >
> > > > > > > > Requirements:
> > > > > > > >
> > > > > > > >    - Avoid scanning/opening files in directories that are
> > > > irrelevant
> > > > > > > >    - Include files in every directory of full/partial
> relevance
> > > > > > > >    - minimize the additional overhead of including this UDF
> in
> > > the
> > > > > SQL
> > > > > > > >    command.
> > > > > > > >
> > > > > > > >
> > > > > > > > The UDF is called dirInRange and it takes this parameters:
> > > > > > > >
> > > > > > > >    - @Param TimeStampHolder from;
> > > > > > > >    - @Param NullableTimeStampHolder to;
> > > > > > > >    - @Param NullableVarCharHolder yearDir;
> > > > > > > >    - @Param NullableVarCharHolder monthDir;
> > > > > > > >    - @Param NullableVarCharHolder dayDir;
> > > > > > > >
> > > > > > > > It currently returns a int (1 if in range and other values
> > > > depending
> > > > > on
> > > > > > > > what failed)
> > > > > > > >
> > > > > > > > The function is defined like this
> > > > > > > >
> > > > > > > > @FunctionTemplate(name = "dirInRange", scope =
> > > > > > > FunctionTemplate.FunctionScope.SIMPLE, nulls =
> > > > > > > FunctionTemplate.NullHandling.NULL_IF_NULL)
> > > > > > > >
> > > > > > > > public static class dirInRange implements DrillSimpleFunc
> > > > > > > >
> > > > > > > > ... and its called like this:
> > > > > > > >
> > > > > > > > "select occurred_at, dir0, dir1, dir2 from
> > > > > > > > dfs.tmp.`/analytics/processed/test/events` as t where
> > > > > > > *dirInRange*(cast('2015-04-10'
> > > > > > > > as timestamp),cast('2015-07-11' as timestamp),dir0,dir1,dir2)
> > = 1
> > > > > order
> > > > > > > by
> > > > > > > > occurred_at;"
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > This it not working and I have several questions:
> > > > > > > >
> > > > > > > >    - Is it possible that this function is only called if/when
> > > dir0,
> > > > > > dir1,
> > > > > > > >    and dir2 all exists
> > > > > > > >    - If that is the case would it not be better that these
> > > > > > > >    utility-variables existed with null value until
> needed/used
> > > > > > > >    - having a Nullable*Holder seems to play no role in this
> at
> > > all
> > > > > > > >
> > > > > > > >    - Is there anything I can do to have this expression
> > evaluated
> > > > > only
> > > > > > > >    once per directory?
> > > > > > > >
> > > > > > > >    - Is there anything I can do to have this evaluated before
> > the
> > > > > files
> > > > > > > >    in the directory are processed?
> > > > > > > >
> > > > > > > >    - How do I pass null value into this?
> > > > > > > >    - this fails:  *dirInRange*(cast('2015-04-10' as
> > > > > > > >    timestamp),null,dir0,dir1,dir2)
> > > > > > > >    Error: PARSE ERROR: From line 1, column 82 to line 1,
> column
> > > 85:
> > > > > > > >    Illegal use of 'NULL'
> > > > > > > >
> > > > > > > > This will be a part of my mini UDF library (
> > > > > > > > https://github.com/acmeguy/asdrill), feel free to use/abuse
> it
> > > if
> > > > it
> > > > > > > > helps.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > >  -Stefán
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Abdelhakim Deneche
> > > > > >
> > > > > > Software Engineer
> > > > > >
> > > > > >   <http://www.mapr.com/>
> > > > > >
> > > > > >
> > > > > > Now Available - Free Hadoop On-Demand Training
> > > > > > <
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: storage structure - querying directories - sanity check and UDF assistance

Reply via email to