[jira] [Created] (DRILL-6314) Add complex types to row set abstraction

2018-04-08 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-6314:
--

 Summary: Add complex types to row set abstraction
 Key: DRILL-6314
 URL: https://issues.apache.org/jira/browse/DRILL-6314
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.14.0
Reporter: Paul Rogers
Assignee: Paul Rogers


Next step in the [Record Batch 
Sizing|https://github.com/paul-rogers/drill/wiki] project: extend the row set 
abstractions with Union, List and Repeated List types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: "Death of Schema-on-Read"

2018-04-08 Thread Ted Dunning
Inline

On Sun, Apr 8, 2018 at 12:36 PM, Aman Sinha  wrote:

> On Sun, Apr 8, 2018 at 10:57 AM, Ted Dunning 
> wrote:
>
> > [cast pushdown is cheaper than you might think]
>
> It is true that the amount of work done by the planner would be about the
> same as when
> determining projection pushdowns into the scan.  In my mind I was
> contrasting with the
> pure DDL based approach with an explicitly specified schema (such as with
> a  'CREATE EXTERNAL TABLE ...' or with per query hints as Paul mentioned).
>

Actually, I think that these are equivalent in many ways. The only
important difference is when the constraints on type are expressed. With
DDL, it is expressed ahead of time.  With cast-pushdown, it is done at the
last moment in the query.  With DDL, we traditionally assume that the
reader has already taken the information into account before the query
starts, but nothing really says that it has to.

The information expressed in either case is essentially identical, however,
and the reader should take heed. We are just allowing late typing if we add
cast pushdown.


... keeping in mind that the same column may be
> referenced in multiple ways:  e.g   CAST(a as varchar(10)),  CAST(a as
> varchar(20))
> in the same query/view.  In such cases, we would want to either not do the
> pushdown or determine the highest common datatype and push that down.
>

I disagree. It is the job of the reader to return both variants if it
accepts the pushdown or return the raw data and not do any casting. This is
really quite similar to cases where the column is retrieved twice.

If the reader can do something clever by reading the data once with one
case and then modifying the column after reading, that's great, but we
can't really assume that it will.



>
> All of this, though, does not preclude the real need for the 'source of
> truth' of the schema for the cases where data has been already explored and
> curated. We do want to have a solution for that core issue.
>

Why is the answer different from a view with casts (that might be pushed
down)?


Re: "Death of Schema-on-Read"

2018-04-08 Thread Aman Sinha
On Sun, Apr 8, 2018 at 10:57 AM, Ted Dunning  wrote:

> I have been thinking about this email and I still don't understand some of
> the comments.
>
> On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha  wrote:
>
> > On the subject of CAST pushdown to Scans, there are potential drawbacks
> > ...
> >
> >- In general, the planner will see a Scan-Project where the Project
> has
> >CAST functions.  But the Project can have arbitrary expressions,  e.g
> >CAST(a as INT) * 5  or a combination of 2 CAST functions or non-CAST
> >functions etc.   It would be quite expensive to examine each
> expression
> >(there could be hundreds) to determine whether it is eligible to be
> > pushed
> >to the Scan.
> >
>
> How is this different than filter and project pushdown? There could be
> hundreds of those and it could be difficult for Calcite to find appropriate
> pushdowns. But I have never heard of any problem.
>
> - the traversal of all expressions is already required and already done in
> order to find the set of columns that are being extracted. As such, cast
> pushdown can be done in the same motions as project pushdown.
>

It is true that the amount of work done by the planner would be about the
same as when
determining projection pushdowns into the scan.  In my mind I was
contrasting with the
pure DDL based approach with an explicitly specified schema (such as with
a  'CREATE EXTERNAL TABLE ...' or with per query hints as Paul mentioned).
However, in the absence of those, I agree that it would be a win to do the
'simple' CAST pushdowns, keeping in mind that the same column may be
referenced
in multiple ways:  e.g   CAST(a as varchar(10)),  CAST(a as varchar(20))
in the same query/view.  In such cases, we would want to either not do the
pushdown
or determine the highest common datatype and push that down.

All of this, though, does not preclude the real need for the 'source of
truth' of the schema for the cases where data has been already explored and
curated.
We do want to have a solution for that core issue.

-Aman


Re: "Death of Schema-on-Read"

2018-04-08 Thread Paul Rogers
Hi Ted,

All good points. In the Jira ticket, I mentioned that using casts to infer type 
is a special case of a general problem: inferring column type information in a 
downward traversal of the parse tree, then refining that type information with 
a bottom-up traversal of the DAG at run time. Turns out this problem is similar 
to one that the dynamic language people are exploring to convert dynamic types 
to static types automatically. There is a paper reference in the ticket.

Our problem occurs, in general, with that bottom-up type inference. Because 
Drill is distributed, different branches of the tree may make different type 
inferences. (We discussed several examples.) We hope that higher nodes can 
reconcile conflicting decisions made by lower nodes, but we've noted cases 
where this is not possible. And, we've noted cases where the scanner itself 
ends up making conflicting decisions (e.g. a long run of nulls followed by 
non-null data that reveals the type.)

Further, schema is a property of the DATA not the QUERY. But, casts are a 
property of the query. So, relying exclusively on a query to provide type 
information must be an incomplete and redundant solution. That said, if type 
information is available only in the query, we should certainly use it to 
resolve column type ambiguity. Cast is simply the most obvious source of 
inference. So, better type inference is helpful, but neither necessary nor 
sufficient.

There is also the mismatch between SQL syntax and the desired simplicity of a 
hint. We want to provide a hint only for a troublesome column or two, but to 
mention one column in a SELECT clause, we must mention all of them; we can no 
longer use the wildcard (which, of course, shouldn't be used, but sure is handy 
during data exploration: the very situation where the type problem is most 
likely to occur.)

Here it might be useful to recall the subtitle of the article: "Data governance 
and the death of schema on read." The author's point is that, in a production 
system, schemas cannot be ad-hoc, loosely-goosey (my terms), but rather must be 
well-understood and agreed upon so that all parties agree on data structure and 
meaning. In Drill's world, it says that schema (or hints) are discovered during 
data exploration, the results captured in some form of metadata, and then used 
by a wide number of users sitting in Tableau getting their day-to-day work 
done. In short, schema (and metadata in general) are needed to move from 
exploration into production. I personally am receptive to this idea because of 
the many years spent building BI tools in which schema information provided a 
much more robust and simple user experience.

Thanks,

- Paul

 

On Sunday, April 8, 2018, 10:58:02 AM PDT, Ted Dunning 
 wrote:  
 
 I have been thinking about this email and I still don't understand some of
the comments.

On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha  wrote:

> On the subject of CAST pushdown to Scans, there are potential drawbacks
> ...
>
>    - In general, the planner will see a Scan-Project where the Project has
>    CAST functions.  But the Project can have arbitrary expressions,  e.g
>    CAST(a as INT) * 5  or a combination of 2 CAST functions or non-CAST
>    functions etc.  It would be quite expensive to examine each expression
>    (there could be hundreds) to determine whether it is eligible to be
> pushed
>    to the Scan.
>

How is this different than filter and project pushdown? There could be
hundreds of those and it could be difficult for Calcite to find appropriate
pushdowns. But I have never heard of any problem.

The reasons that I think that cast pushdown would be much easier include:

- for a first approximation, no type inference would be needed.

- because of the first point, only the roots of arithmetic expressions
would need to be examined. If they have casts, then pushdown should be
tried. If not, don't do it.

- cast pushdown is always a win if supported so there is no large increase
in the complexity of the cost-based optimization search space.

- the traversal of all expressions is already required and already done in
order to find the set of columns that are being extracted. As such, cast
pushdown can be done in the same motions as project pushdown.


>    - Expressing Nullability is not possible with CAST.  If a column should
>    be tagged as  (not)nullable, CAST syntax does not allow that.
>

This may be true. But nullability crosses the cast cleanly. Thus, filter
expressions like [x is not NULL] can be used to constrain nullability and
there is no requirement that the two constraints (the cast and the
nullability) need not be near each other syntactically. Furthermore, if the
query does not specify nullability, then the scanner is free to do so.


>    - Drill currently supports CASTing to a SQL data type, but not to the
>    complex types such as arrays and maps.  We would have to add support for
>    

Re: "Death of Schema-on-Read"

2018-04-08 Thread Ted Dunning
I have been thinking about this email and I still don't understand some of
the comments.

On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha  wrote:

> On the subject of CAST pushdown to Scans, there are potential drawbacks
> ...
>
>- In general, the planner will see a Scan-Project where the Project has
>CAST functions.  But the Project can have arbitrary expressions,  e.g
>CAST(a as INT) * 5  or a combination of 2 CAST functions or non-CAST
>functions etc.   It would be quite expensive to examine each expression
>(there could be hundreds) to determine whether it is eligible to be
> pushed
>to the Scan.
>

How is this different than filter and project pushdown? There could be
hundreds of those and it could be difficult for Calcite to find appropriate
pushdowns. But I have never heard of any problem.

The reasons that I think that cast pushdown would be much easier include:

- for a first approximation, no type inference would be needed.

- because of the first point, only the roots of arithmetic expressions
would need to be examined. If they have casts, then pushdown should be
tried. If not, don't do it.

- cast pushdown is always a win if supported so there is no large increase
in the complexity of the cost-based optimization search space.

- the traversal of all expressions is already required and already done in
order to find the set of columns that are being extracted. As such, cast
pushdown can be done in the same motions as project pushdown.


>- Expressing Nullability is not possible with CAST.   If a column should
>be tagged as  (not)nullable, CAST syntax does not allow that.
>

This may be true. But nullability crosses the cast cleanly. Thus, filter
expressions like [x is not NULL] can be used to constrain nullability and
there is no requirement that the two constraints (the cast and the
nullability) need not be near each other syntactically. Furthermore, if the
query does not specify nullability, then the scanner is free to do so.


>- Drill currently supports CASTing to a SQL data type, but not to the
>complex types such as arrays and maps.  We would have to add support for
>that from a language perspective as well as the run-time.  This would be
>non-trivial effort.
>

Well, there is a trivial subset of this effort in that casting a.b.c is
easy to express. Anything more complex is hard for current scanners to use
anyway.

So deferring most of the work on complex types is a fine solution. It isn't
like SQL has nice syntax for casting of anything.


Re: [DISCUSS] DrillBuf

2018-04-08 Thread Paul Rogers
Hi Vlad,

Thanks for the clarifications. My general comment is that it is always good to 
refactor things if that is the fastest way to achieve some goal. It is not 
clear, however, what the goal is here other than code improvement. Since Drill 
still has plenty of opportunities for improvement that will help users, I guess 
it's a bit unclear why we'd clean up code just for its own sake if doing so 
will entail a large amount of work and provide no user benefit.

I say this because I've done quite a bit of work in surrounding code and 
learned how much work it takes to make these kinds of changes and then 
stabilize the result.

On the other hand, if you are working on a memory-related project that is 
making major changes, and these issues are getting in your way, then 
refactoring could well be the fastest way to achieve your project goals. Are 
you working on such a project?

> why was it necessary to introduce DrillBuf and/or UnsafeDirectLittleEndian? 
> What functionality do those classes provide that existing Netty classes do 
> not?
> IMO, it will be good to make DrillBuf code simpler and consistent.

Good questions!  So, it seems that DrillBuf turns out to be a bit of a muddle. 
But, unless it is preventing us from making some desired change, or changing 
things will provide a significant performance boost, I'd guess I'd just hold my 
nose and leave it as is for now.

The same argument applies, by the way, to the value vectors. The value vector 
classes become almost entirely redundant once the row set mechanisms are 
adopted. But, I suspect that value vectors will live on anyway until there is a 
reason to do something with them.

> Question is whether or not to check that reference count is not zero every 
> time DrillBuf is used (see ensureAccessible()).

IMHO, there is no reason to check on every access, except in a "paranoid" debug 
mode. Removing the check might provide a nice little performance bump. Avoiding 
bounds and ref checks was one of the goals of the "unchecked" methods that I 
had in DrillBuf but which we decided to remove... If it turns out that the 
methods used by the row set abstractions do, in fact, do bounds checks, then 
this is a strong case to put the "unsafe" (unchecked) methods back.

> 5. Moving DrillBuf to a different package


I agree with your explanation. However, I assume the original authors were 
forced to put DrillBuf in the Netty package for some reason or other. If that 
reason is no longer valid, then it is usually pretty simple to have your IDE 
move the class to a different package and adjust all its references. This 
improvement, if possible, seems low cost and so might be worth doing.

On the other hand, if the move causes things to break, which causes effort to 
go into changing things, I guess I'd wonder why we can't just leave well enough 
alone and focus on things which are actually broken or could use a performance 
boost.

In short, I'm all for refactoring when it helps us deliver fixes or new 
features to customers. But, I'm struggling to see the user benefit in this 
case. Can you help me to understand the user benefit of these changes?

Thanks,

- Paul

 

On Saturday, April 7, 2018, 9:35:06 PM PDT, Vlad Rozov  
wrote:  
 
 Hi Paul,

My comments in-line.

Thank you,

Vlad

On 4/5/18 20:50, Paul Rogers wrote:
> Hi Vlad,
>>  I'd suggest to keep focus on DrillBuf design and implementation as the only 
>>gate for accessing raw (direct) memory.
> I was doing that. By explaining where DrillBuf fits in the overall design, we 
> see that DrillBuf should be the only access point for direct memory. The 
> context explains why this is the right decision. Changes to DrillBuf should 
> support our design as DrillBuf only exists for that purpose.
My concern is not why Drill adopted Netty as Netty does provide a good 
amount of functionality on top of Java networking and NIO. I also do not 
propose to replace Netty with something else. My primary focus for this 
thread is the design and implementation of the DrillBuf Java class 
itself. Namely, why was it necessary to introduce DrillBuf and/or 
UnsafeDirectLittleEndian? What functionality do those classes provide 
that existing Netty classes do not? Netty already provides memory 
pooling, reference counting, slicing, composite buffers, working with 
direct and heap memory. By looking at the DrillBuf.java git history, the 
DrillBuf was introduced in 2014 and prior to that Netty classes were 
used directly. Unfortunately, the commit that introduced DrillBuf does 
not provide any info why it was introduced and does not have a reference 
to a JIRA.

One may argue that DrillBuf is a way for Drill to encapsulate Netty 
ByteBuf and guard other modules that use DrillBuf from Netty ByteBuf 
API, so if Netty decides to change ByteBuf API in the next major 
release, amount of changes will be limited to DrillBuf only. Problem is 
that DrillBuf inherits from Netty AbstractByteBuf, so the above goal 

Re: Non-column filters in Drill

2018-04-08 Thread Paul Rogers
Hi Ryan,

There is an obscure, but very handy feature of Drill called table functions. 
[1] These allow you to set parameters of your format plugin as part of a query.

You mentioned a storage plugin. I've not tried a table function with a storage 
plugin. I have tested table functions with a format plugin.

Your format or storage plugin has a Jackson-serializable Java class. Normally 
you set the properties for your plugin in the Drill web console. But, these can 
also be set in the table function.

I had a use case something like yours. I defined an example "regex" plugin 
where the user can specify a regular expression to apply to to a text file to 
parse columns. The use can then provide a list of column names. Using the table 
function, I could specify the regex and column names per-query.

This exercise did, however, point out two current limitations of table 
functions. First, they work only with simple data types (strings, ints). 
(DRILL-6169) So, my list of columns has to be a single string with a comma 
delimited list of columns. I could not use the more natural list of strings. 
Second, table functions do not retain the configured value of parameters: you 
have to include all parameters in the function, not just the ones you want to 
change. (DRILL-6168)

Yet another option is to set a session option. However, unless you do a bit of 
clever coding, format plugins don't have visibility to session options 
(DRILL-5181).

Perhaps your use case provides a compelling reason to fix some of these 
limitations...

Thanks,

- Paul

[1] 
https://drill.apache.org/docs/plugin-configuration-basics/#using-the-formats-attributes-as-table-function-parameters,
 see the section "Using the Formats Attributes as Table Function Parameters".


On Saturday, April 7, 2018, 10:37:05 PM PDT, Aman Sinha 
 wrote:  
 
 A better option would be to have a user-defined function that takes 2
parameters and evaluates to a boolean value.
 e.g  select * from myTable where MyUDF(notColumn, 'value')  IS TRUE;

The Storage Plugin that you are developing would need to implement a
pushdown rule that  looks
at the filter condition and if it contains 'MyUDF()', it would pushdown to
the scan/reader corresponding to your plugin.


On Sat, Apr 7, 2018 at 6:58 PM, Hanumath Rao Maduri 
wrote:

> Hello Ryan,
>
> Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be
> supplied by the underlying scan.
> However, I expect that this column will be present in the scan but not past
> the filter (notColumn = 'value') in the plan.
> In that case you may need to pushdown the filter to the groupScan and then
> remove the column projections from your custom groupscan.
>
> It would be easy for us to guess what could be the issue, if you can post
> the logical and physical query plan's for this query.
>
> Hope this helps. Please do let us know if you have any further issues.
>
> Thanks,
>
>
> On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks 
> wrote:
>
> > Hi Drill Dev Team!
> >
> > I am writing a custom storage plugin and I am curious if it is possible
> in
> > Drill to pass a filter value, in the form of a where clause, that is not
> > related to a column. What I would like to accomplish is something like:
> >
> > select * from myTable where notColumn = 'value';
> >
> > In the example, notColumn is not a column in myTable, or any other table,
> > it is just a specific parameter that the storage plugin will use in the
> > filtering process. Additionally, notColumn would not be returned as a
> > column so Drill needs to not expect it as a part of the 'select *'. I
> > created a rule that will push down and remove these non-column filter
> > calls, but I need to somehow tell drill/calcite that the filter name is
> > valid, without actually registering it as a column. The following error
> > occurs prior to submitting any rules:
> >
> > org.apache.drill.common.exceptions.UserRemoteException: VALIDATION
> ERROR:
> > From line 1, column 35 to line 1, column 39: Column 'notColumn' not found
> > in any table
> >
> >
> > Alternatively, can I manipulate star queries to only return a subset of
> > all the columns for a table?
> >
> > Any insight would be greatly appreciated!
> >
> > Thanks,
> > Ryan
> >
>
  

Re: "Death of Schema-on-Read"

2018-04-08 Thread Paul Rogers
Hi Hanu,
Thanks! After sleeping on the idea, I realized that it can be generalized for 
any kind of expression. But, I also realized that the cast mechanism, by 
itself, cannot be a complete solution. Details posted in the JIRA for anyone 
who is interested.
Thanks,
- Paul

 

On Saturday, April 7, 2018, 8:05:34 AM PDT, Hanumath Rao Maduri 
 wrote:  
 
 Hello All,

I have created a JIRA to track this approach.
https://issues.apache.org/jira/browse/DRILL-6312

Thanks,
-Hanu