[jira] [Created] (DRILL-6314) Add complex types to row set abstraction
Paul Rogers created DRILL-6314: -- Summary: Add complex types to row set abstraction Key: DRILL-6314 URL: https://issues.apache.org/jira/browse/DRILL-6314 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.14.0 Reporter: Paul Rogers Assignee: Paul Rogers Next step in the [Record Batch Sizing|https://github.com/paul-rogers/drill/wiki] project: extend the row set abstractions with Union, List and Repeated List types. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: "Death of Schema-on-Read"
Inline On Sun, Apr 8, 2018 at 12:36 PM, Aman Sinhawrote: > On Sun, Apr 8, 2018 at 10:57 AM, Ted Dunning > wrote: > > > [cast pushdown is cheaper than you might think] > > It is true that the amount of work done by the planner would be about the > same as when > determining projection pushdowns into the scan. In my mind I was > contrasting with the > pure DDL based approach with an explicitly specified schema (such as with > a 'CREATE EXTERNAL TABLE ...' or with per query hints as Paul mentioned). > Actually, I think that these are equivalent in many ways. The only important difference is when the constraints on type are expressed. With DDL, it is expressed ahead of time. With cast-pushdown, it is done at the last moment in the query. With DDL, we traditionally assume that the reader has already taken the information into account before the query starts, but nothing really says that it has to. The information expressed in either case is essentially identical, however, and the reader should take heed. We are just allowing late typing if we add cast pushdown. ... keeping in mind that the same column may be > referenced in multiple ways: e.g CAST(a as varchar(10)), CAST(a as > varchar(20)) > in the same query/view. In such cases, we would want to either not do the > pushdown or determine the highest common datatype and push that down. > I disagree. It is the job of the reader to return both variants if it accepts the pushdown or return the raw data and not do any casting. This is really quite similar to cases where the column is retrieved twice. If the reader can do something clever by reading the data once with one case and then modifying the column after reading, that's great, but we can't really assume that it will. > > All of this, though, does not preclude the real need for the 'source of > truth' of the schema for the cases where data has been already explored and > curated. We do want to have a solution for that core issue. > Why is the answer different from a view with casts (that might be pushed down)?
Re: "Death of Schema-on-Read"
On Sun, Apr 8, 2018 at 10:57 AM, Ted Dunningwrote: > I have been thinking about this email and I still don't understand some of > the comments. > > On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha wrote: > > > On the subject of CAST pushdown to Scans, there are potential drawbacks > > ... > > > >- In general, the planner will see a Scan-Project where the Project > has > >CAST functions. But the Project can have arbitrary expressions, e.g > >CAST(a as INT) * 5 or a combination of 2 CAST functions or non-CAST > >functions etc. It would be quite expensive to examine each > expression > >(there could be hundreds) to determine whether it is eligible to be > > pushed > >to the Scan. > > > > How is this different than filter and project pushdown? There could be > hundreds of those and it could be difficult for Calcite to find appropriate > pushdowns. But I have never heard of any problem. > > - the traversal of all expressions is already required and already done in > order to find the set of columns that are being extracted. As such, cast > pushdown can be done in the same motions as project pushdown. > It is true that the amount of work done by the planner would be about the same as when determining projection pushdowns into the scan. In my mind I was contrasting with the pure DDL based approach with an explicitly specified schema (such as with a 'CREATE EXTERNAL TABLE ...' or with per query hints as Paul mentioned). However, in the absence of those, I agree that it would be a win to do the 'simple' CAST pushdowns, keeping in mind that the same column may be referenced in multiple ways: e.g CAST(a as varchar(10)), CAST(a as varchar(20)) in the same query/view. In such cases, we would want to either not do the pushdown or determine the highest common datatype and push that down. All of this, though, does not preclude the real need for the 'source of truth' of the schema for the cases where data has been already explored and curated. We do want to have a solution for that core issue. -Aman
Re: "Death of Schema-on-Read"
Hi Ted, All good points. In the Jira ticket, I mentioned that using casts to infer type is a special case of a general problem: inferring column type information in a downward traversal of the parse tree, then refining that type information with a bottom-up traversal of the DAG at run time. Turns out this problem is similar to one that the dynamic language people are exploring to convert dynamic types to static types automatically. There is a paper reference in the ticket. Our problem occurs, in general, with that bottom-up type inference. Because Drill is distributed, different branches of the tree may make different type inferences. (We discussed several examples.) We hope that higher nodes can reconcile conflicting decisions made by lower nodes, but we've noted cases where this is not possible. And, we've noted cases where the scanner itself ends up making conflicting decisions (e.g. a long run of nulls followed by non-null data that reveals the type.) Further, schema is a property of the DATA not the QUERY. But, casts are a property of the query. So, relying exclusively on a query to provide type information must be an incomplete and redundant solution. That said, if type information is available only in the query, we should certainly use it to resolve column type ambiguity. Cast is simply the most obvious source of inference. So, better type inference is helpful, but neither necessary nor sufficient. There is also the mismatch between SQL syntax and the desired simplicity of a hint. We want to provide a hint only for a troublesome column or two, but to mention one column in a SELECT clause, we must mention all of them; we can no longer use the wildcard (which, of course, shouldn't be used, but sure is handy during data exploration: the very situation where the type problem is most likely to occur.) Here it might be useful to recall the subtitle of the article: "Data governance and the death of schema on read." The author's point is that, in a production system, schemas cannot be ad-hoc, loosely-goosey (my terms), but rather must be well-understood and agreed upon so that all parties agree on data structure and meaning. In Drill's world, it says that schema (or hints) are discovered during data exploration, the results captured in some form of metadata, and then used by a wide number of users sitting in Tableau getting their day-to-day work done. In short, schema (and metadata in general) are needed to move from exploration into production. I personally am receptive to this idea because of the many years spent building BI tools in which schema information provided a much more robust and simple user experience. Thanks, - Paul On Sunday, April 8, 2018, 10:58:02 AM PDT, Ted Dunningwrote: I have been thinking about this email and I still don't understand some of the comments. On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinha wrote: > On the subject of CAST pushdown to Scans, there are potential drawbacks > ... > > - In general, the planner will see a Scan-Project where the Project has > CAST functions. But the Project can have arbitrary expressions, e.g > CAST(a as INT) * 5 or a combination of 2 CAST functions or non-CAST > functions etc. It would be quite expensive to examine each expression > (there could be hundreds) to determine whether it is eligible to be > pushed > to the Scan. > How is this different than filter and project pushdown? There could be hundreds of those and it could be difficult for Calcite to find appropriate pushdowns. But I have never heard of any problem. The reasons that I think that cast pushdown would be much easier include: - for a first approximation, no type inference would be needed. - because of the first point, only the roots of arithmetic expressions would need to be examined. If they have casts, then pushdown should be tried. If not, don't do it. - cast pushdown is always a win if supported so there is no large increase in the complexity of the cost-based optimization search space. - the traversal of all expressions is already required and already done in order to find the set of columns that are being extracted. As such, cast pushdown can be done in the same motions as project pushdown. > - Expressing Nullability is not possible with CAST. If a column should > be tagged as (not)nullable, CAST syntax does not allow that. > This may be true. But nullability crosses the cast cleanly. Thus, filter expressions like [x is not NULL] can be used to constrain nullability and there is no requirement that the two constraints (the cast and the nullability) need not be near each other syntactically. Furthermore, if the query does not specify nullability, then the scanner is free to do so. > - Drill currently supports CASTing to a SQL data type, but not to the > complex types such as arrays and maps. We would have to add support for >
Re: "Death of Schema-on-Read"
I have been thinking about this email and I still don't understand some of the comments. On Fri, Apr 6, 2018 at 5:13 PM, Aman Sinhawrote: > On the subject of CAST pushdown to Scans, there are potential drawbacks > ... > >- In general, the planner will see a Scan-Project where the Project has >CAST functions. But the Project can have arbitrary expressions, e.g >CAST(a as INT) * 5 or a combination of 2 CAST functions or non-CAST >functions etc. It would be quite expensive to examine each expression >(there could be hundreds) to determine whether it is eligible to be > pushed >to the Scan. > How is this different than filter and project pushdown? There could be hundreds of those and it could be difficult for Calcite to find appropriate pushdowns. But I have never heard of any problem. The reasons that I think that cast pushdown would be much easier include: - for a first approximation, no type inference would be needed. - because of the first point, only the roots of arithmetic expressions would need to be examined. If they have casts, then pushdown should be tried. If not, don't do it. - cast pushdown is always a win if supported so there is no large increase in the complexity of the cost-based optimization search space. - the traversal of all expressions is already required and already done in order to find the set of columns that are being extracted. As such, cast pushdown can be done in the same motions as project pushdown. >- Expressing Nullability is not possible with CAST. If a column should >be tagged as (not)nullable, CAST syntax does not allow that. > This may be true. But nullability crosses the cast cleanly. Thus, filter expressions like [x is not NULL] can be used to constrain nullability and there is no requirement that the two constraints (the cast and the nullability) need not be near each other syntactically. Furthermore, if the query does not specify nullability, then the scanner is free to do so. >- Drill currently supports CASTing to a SQL data type, but not to the >complex types such as arrays and maps. We would have to add support for >that from a language perspective as well as the run-time. This would be >non-trivial effort. > Well, there is a trivial subset of this effort in that casting a.b.c is easy to express. Anything more complex is hard for current scanners to use anyway. So deferring most of the work on complex types is a fine solution. It isn't like SQL has nice syntax for casting of anything.
Re: [DISCUSS] DrillBuf
Hi Vlad, Thanks for the clarifications. My general comment is that it is always good to refactor things if that is the fastest way to achieve some goal. It is not clear, however, what the goal is here other than code improvement. Since Drill still has plenty of opportunities for improvement that will help users, I guess it's a bit unclear why we'd clean up code just for its own sake if doing so will entail a large amount of work and provide no user benefit. I say this because I've done quite a bit of work in surrounding code and learned how much work it takes to make these kinds of changes and then stabilize the result. On the other hand, if you are working on a memory-related project that is making major changes, and these issues are getting in your way, then refactoring could well be the fastest way to achieve your project goals. Are you working on such a project? > why was it necessary to introduce DrillBuf and/or UnsafeDirectLittleEndian? > What functionality do those classes provide that existing Netty classes do > not? > IMO, it will be good to make DrillBuf code simpler and consistent. Good questions! So, it seems that DrillBuf turns out to be a bit of a muddle. But, unless it is preventing us from making some desired change, or changing things will provide a significant performance boost, I'd guess I'd just hold my nose and leave it as is for now. The same argument applies, by the way, to the value vectors. The value vector classes become almost entirely redundant once the row set mechanisms are adopted. But, I suspect that value vectors will live on anyway until there is a reason to do something with them. > Question is whether or not to check that reference count is not zero every > time DrillBuf is used (see ensureAccessible()). IMHO, there is no reason to check on every access, except in a "paranoid" debug mode. Removing the check might provide a nice little performance bump. Avoiding bounds and ref checks was one of the goals of the "unchecked" methods that I had in DrillBuf but which we decided to remove... If it turns out that the methods used by the row set abstractions do, in fact, do bounds checks, then this is a strong case to put the "unsafe" (unchecked) methods back. > 5. Moving DrillBuf to a different package I agree with your explanation. However, I assume the original authors were forced to put DrillBuf in the Netty package for some reason or other. If that reason is no longer valid, then it is usually pretty simple to have your IDE move the class to a different package and adjust all its references. This improvement, if possible, seems low cost and so might be worth doing. On the other hand, if the move causes things to break, which causes effort to go into changing things, I guess I'd wonder why we can't just leave well enough alone and focus on things which are actually broken or could use a performance boost. In short, I'm all for refactoring when it helps us deliver fixes or new features to customers. But, I'm struggling to see the user benefit in this case. Can you help me to understand the user benefit of these changes? Thanks, - Paul On Saturday, April 7, 2018, 9:35:06 PM PDT, Vlad Rozovwrote: Hi Paul, My comments in-line. Thank you, Vlad On 4/5/18 20:50, Paul Rogers wrote: > Hi Vlad, >> I'd suggest to keep focus on DrillBuf design and implementation as the only >>gate for accessing raw (direct) memory. > I was doing that. By explaining where DrillBuf fits in the overall design, we > see that DrillBuf should be the only access point for direct memory. The > context explains why this is the right decision. Changes to DrillBuf should > support our design as DrillBuf only exists for that purpose. My concern is not why Drill adopted Netty as Netty does provide a good amount of functionality on top of Java networking and NIO. I also do not propose to replace Netty with something else. My primary focus for this thread is the design and implementation of the DrillBuf Java class itself. Namely, why was it necessary to introduce DrillBuf and/or UnsafeDirectLittleEndian? What functionality do those classes provide that existing Netty classes do not? Netty already provides memory pooling, reference counting, slicing, composite buffers, working with direct and heap memory. By looking at the DrillBuf.java git history, the DrillBuf was introduced in 2014 and prior to that Netty classes were used directly. Unfortunately, the commit that introduced DrillBuf does not provide any info why it was introduced and does not have a reference to a JIRA. One may argue that DrillBuf is a way for Drill to encapsulate Netty ByteBuf and guard other modules that use DrillBuf from Netty ByteBuf API, so if Netty decides to change ByteBuf API in the next major release, amount of changes will be limited to DrillBuf only. Problem is that DrillBuf inherits from Netty AbstractByteBuf, so the above goal
Re: Non-column filters in Drill
Hi Ryan, There is an obscure, but very handy feature of Drill called table functions. [1] These allow you to set parameters of your format plugin as part of a query. You mentioned a storage plugin. I've not tried a table function with a storage plugin. I have tested table functions with a format plugin. Your format or storage plugin has a Jackson-serializable Java class. Normally you set the properties for your plugin in the Drill web console. But, these can also be set in the table function. I had a use case something like yours. I defined an example "regex" plugin where the user can specify a regular expression to apply to to a text file to parse columns. The use can then provide a list of column names. Using the table function, I could specify the regex and column names per-query. This exercise did, however, point out two current limitations of table functions. First, they work only with simple data types (strings, ints). (DRILL-6169) So, my list of columns has to be a single string with a comma delimited list of columns. I could not use the more natural list of strings. Second, table functions do not retain the configured value of parameters: you have to include all parameters in the function, not just the ones you want to change. (DRILL-6168) Yet another option is to set a session option. However, unless you do a bit of clever coding, format plugins don't have visibility to session options (DRILL-5181). Perhaps your use case provides a compelling reason to fix some of these limitations... Thanks, - Paul [1] https://drill.apache.org/docs/plugin-configuration-basics/#using-the-formats-attributes-as-table-function-parameters, see the section "Using the Formats Attributes as Table Function Parameters". On Saturday, April 7, 2018, 10:37:05 PM PDT, Aman Sinhawrote: A better option would be to have a user-defined function that takes 2 parameters and evaluates to a boolean value. e.g select * from myTable where MyUDF(notColumn, 'value') IS TRUE; The Storage Plugin that you are developing would need to implement a pushdown rule that looks at the filter condition and if it contains 'MyUDF()', it would pushdown to the scan/reader corresponding to your plugin. On Sat, Apr 7, 2018 at 6:58 PM, Hanumath Rao Maduri wrote: > Hello Ryan, > > Thank you for trying out Drill. Drill/Calcite expects "notColumn" to be > supplied by the underlying scan. > However, I expect that this column will be present in the scan but not past > the filter (notColumn = 'value') in the plan. > In that case you may need to pushdown the filter to the groupScan and then > remove the column projections from your custom groupscan. > > It would be easy for us to guess what could be the issue, if you can post > the logical and physical query plan's for this query. > > Hope this helps. Please do let us know if you have any further issues. > > Thanks, > > > On Sat, Apr 7, 2018 at 2:08 PM, Ryan Shanks > wrote: > > > Hi Drill Dev Team! > > > > I am writing a custom storage plugin and I am curious if it is possible > in > > Drill to pass a filter value, in the form of a where clause, that is not > > related to a column. What I would like to accomplish is something like: > > > > select * from myTable where notColumn = 'value'; > > > > In the example, notColumn is not a column in myTable, or any other table, > > it is just a specific parameter that the storage plugin will use in the > > filtering process. Additionally, notColumn would not be returned as a > > column so Drill needs to not expect it as a part of the 'select *'. I > > created a rule that will push down and remove these non-column filter > > calls, but I need to somehow tell drill/calcite that the filter name is > > valid, without actually registering it as a column. The following error > > occurs prior to submitting any rules: > > > > org.apache.drill.common.exceptions.UserRemoteException: VALIDATION > ERROR: > > From line 1, column 35 to line 1, column 39: Column 'notColumn' not found > > in any table > > > > > > Alternatively, can I manipulate star queries to only return a subset of > > all the columns for a table? > > > > Any insight would be greatly appreciated! > > > > Thanks, > > Ryan > > >
Re: "Death of Schema-on-Read"
Hi Hanu, Thanks! After sleeping on the idea, I realized that it can be generalized for any kind of expression. But, I also realized that the cast mechanism, by itself, cannot be a complete solution. Details posted in the JIRA for anyone who is interested. Thanks, - Paul On Saturday, April 7, 2018, 8:05:34 AM PDT, Hanumath Rao Maduriwrote: Hello All, I have created a JIRA to track this approach. https://issues.apache.org/jira/browse/DRILL-6312 Thanks, -Hanu