[jira] [Updated] (DRILL-6853) Parquet Complex Reader for nested schema should have configurable memory or max records to fetch
[ https://issues.apache.org/jira/browse/DRILL-6853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nitin Sharma updated DRILL-6853: Affects Version/s: 1.14.0 > Parquet Complex Reader for nested schema should have configurable memory or > max records to fetch > > > Key: DRILL-6853 > URL: https://issues.apache.org/jira/browse/DRILL-6853 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.14.0 >Reporter: Nitin Sharma >Priority: Major > > Parquet Complex reader while fetching nested schema should have configurable > memory or max records to fetch and not default to 4000 records. > While scanning TB of data with wider columns, this could easily cause OOM > issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6854) Profiles page should provide more insights on parquet statistics for complex reader
Nitin Sharma created DRILL-6854: --- Summary: Profiles page should provide more insights on parquet statistics for complex reader Key: DRILL-6854 URL: https://issues.apache.org/jira/browse/DRILL-6854 Project: Apache Drill Issue Type: Bug Reporter: Nitin Sharma Profiles page should provide more insights on parquet statistics for complex reader. E.g. For plain reader the operator metrics are good. For complex reader, operator metrics are always empty. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6853) Parquet Complex Reader for nested schema should have configurable memory or max records to fetch
[ https://issues.apache.org/jira/browse/DRILL-6853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687560#comment-16687560 ] Nitin Sharma commented on DRILL-6853: - [~vitalii] [~sachouche] Filing as per our discussion earlier today > Parquet Complex Reader for nested schema should have configurable memory or > max records to fetch > > > Key: DRILL-6853 > URL: https://issues.apache.org/jira/browse/DRILL-6853 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.14.0 >Reporter: Nitin Sharma >Priority: Major > > Parquet Complex reader while fetching nested schema should have configurable > memory or max records to fetch and not default to 4000 records. > While scanning TB of data with wider columns, this could easily cause OOM > issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6853) Parquet Complex Reader for nested schema should have configurable memory or max records to fetch
Nitin Sharma created DRILL-6853: --- Summary: Parquet Complex Reader for nested schema should have configurable memory or max records to fetch Key: DRILL-6853 URL: https://issues.apache.org/jira/browse/DRILL-6853 Project: Apache Drill Issue Type: Bug Reporter: Nitin Sharma Parquet Complex reader while fetching nested schema should have configurable memory or max records to fetch and not default to 4000 records. While scanning TB of data with wider columns, this could easily cause OOM issues. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6668) In Web Console, highlight options that are different from default values
[ https://issues.apache.org/jira/browse/DRILL-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687412#comment-16687412 ] Kunal Khatua commented on DRILL-6668: - [~paul-rogers] Prototype... mouse over {{Default}} button prompts with the message !screenshot-1.png! If the value is already default, the {{Default}} button is disabled. Let me know if this is sufficient and I'll open a PR > In Web Console, highlight options that are different from default values > > > Key: DRILL-6668 > URL: https://issues.apache.org/jira/browse/DRILL-6668 > Project: Apache Drill > Issue Type: Improvement > Components: Web Server >Affects Versions: 1.14.0 >Reporter: Paul Rogers >Assignee: Kunal Khatua >Priority: Minor > Fix For: 1.16.0 > > Attachments: screenshot-1.png > > > Suppose you inherit a Drill setup created by someone else (or by you, some > time in the past). Or, suppose you are a support person. You want to know > which Drill options have been changed from the defaults. > The Web UI conveniently displays all options. But, there is no indication of > which might have non-default values. > After the improvements of the last year, the information needed to detect > non-default values is now available. Would be great to mark these values. > Perhaps using colors, perhaps with words. > For example: > *planner.width.max_per_node* 200 \[Update] > Or > planner.width.max_per_node (system) 200 \[Update] > (The Web UI does not, I believe, show session settings, since the Web UI has > no sessions. I believe the custom values are all set by {{ALTER SYSTEM}}. > Otherwise, we could also have a "(session)" suffix above.) > Then, in addition to the {{[Update]}} button, for non default values, also > provide a {{[Reset]}} button that does the same as {{ALTER SESSION RESET}}. > planner.width.max_per_node (session) 200 \[Update] \[Reset] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6668) In Web Console, highlight options that are different from default values
[ https://issues.apache.org/jira/browse/DRILL-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khatua updated DRILL-6668: Attachment: screenshot-1.png > In Web Console, highlight options that are different from default values > > > Key: DRILL-6668 > URL: https://issues.apache.org/jira/browse/DRILL-6668 > Project: Apache Drill > Issue Type: Improvement > Components: Web Server >Affects Versions: 1.14.0 >Reporter: Paul Rogers >Assignee: Kunal Khatua >Priority: Minor > Fix For: 1.16.0 > > Attachments: screenshot-1.png > > > Suppose you inherit a Drill setup created by someone else (or by you, some > time in the past). Or, suppose you are a support person. You want to know > which Drill options have been changed from the defaults. > The Web UI conveniently displays all options. But, there is no indication of > which might have non-default values. > After the improvements of the last year, the information needed to detect > non-default values is now available. Would be great to mark these values. > Perhaps using colors, perhaps with words. > For example: > *planner.width.max_per_node* 200 \[Update] > Or > planner.width.max_per_node (system) 200 \[Update] > (The Web UI does not, I believe, show session settings, since the Web UI has > no sessions. I believe the custom values are all set by {{ALTER SYSTEM}}. > Otherwise, we could also have a "(session)" suffix above.) > Then, in addition to the {{[Update]}} button, for non default values, also > provide a {{[Reset]}} button that does the same as {{ALTER SESSION RESET}}. > planner.width.max_per_node (session) 200 \[Update] \[Reset] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6668) In Web Console, highlight options that are different from default values
[ https://issues.apache.org/jira/browse/DRILL-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687290#comment-16687290 ] Kunal Khatua commented on DRILL-6668: - I think a *DEFAULT* button with the value itself displayed (instead of the button label {{[DEFAULT]}} or {{[RESET]}} ) might be an option to avoid the confusion. [~paul-rogers] which do you prefer? > In Web Console, highlight options that are different from default values > > > Key: DRILL-6668 > URL: https://issues.apache.org/jira/browse/DRILL-6668 > Project: Apache Drill > Issue Type: Improvement > Components: Web Server >Affects Versions: 1.14.0 >Reporter: Paul Rogers >Assignee: Kunal Khatua >Priority: Minor > Fix For: 1.16.0 > > > Suppose you inherit a Drill setup created by someone else (or by you, some > time in the past). Or, suppose you are a support person. You want to know > which Drill options have been changed from the defaults. > The Web UI conveniently displays all options. But, there is no indication of > which might have non-default values. > After the improvements of the last year, the information needed to detect > non-default values is now available. Would be great to mark these values. > Perhaps using colors, perhaps with words. > For example: > *planner.width.max_per_node* 200 \[Update] > Or > planner.width.max_per_node (system) 200 \[Update] > (The Web UI does not, I believe, show session settings, since the Web UI has > no sessions. I believe the custom values are all set by {{ALTER SYSTEM}}. > Otherwise, we could also have a "(session)" suffix above.) > Then, in addition to the {{[Update]}} button, for non default values, also > provide a {{[Reset]}} button that does the same as {{ALTER SESSION RESET}}. > planner.width.max_per_node (session) 200 \[Update] \[Reset] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6847) Add Query Metadata to RESTful Interface
[ https://issues.apache.org/jira/browse/DRILL-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687287#comment-16687287 ] ASF GitHub Bot commented on DRILL-6847: --- kkhatua commented on issue #1539: DRILL-6847: Add Query Metadata to RESTful Interface URL: https://github.com/apache/drill/pull/1539#issuecomment-438860831 @cgivre I didn't build and try this out, but I'm curious on how we manage for `select * from` queries. Especially with schema changes... like fields being added or dropped between rows. Also, I agree with @arina-ielchiieva 's point of not repeating the field names (unless schema change requires it). But I'm not sure how badly we break backward compatibility (perhaps carry a `version` in the REST response?). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Query Metadata to RESTful Interface > --- > > Key: DRILL-6847 > URL: https://issues.apache.org/jira/browse/DRILL-6847 > Project: Apache Drill > Issue Type: Improvement > Components: Metadata >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Minor > > The Drill RESTful interface does not return the structure of the query > results. This makes integrating Drill with other BI tools difficult because > they do not know what kind of data to expect. > This PR adds a new section to the results called Metadata which contains a > list of the minor types of all the columns returned. > The query below will now return the following in the RESTful interface: > {code:sql} > SELECT CAST( employee_id AS INT) AS employee_id, > full_name, > first_name, > last_name, > CAST( position_id AS BIGINT) AS position_id, > position_title > FROM cp.`employee.json` LIMIT 2 > {code} > {code} > { > "queryId": "2414bf3f-b4f4-d4df-825f-73dfb3a56681", > "columns": [ > "employee_id", > "full_name", > "first_name", > "last_name", > "position_id", > "position_title" > ], > "metadata": [ > "INT", > "VARCHAR", > "VARCHAR", > "VARCHAR", > "BIGINT", > "VARCHAR" > ], > "rows": [ > { > "full_name": "Sheri Nowmer", > "employee_id": "1", > "last_name": "Nowmer", > "position_title": "President", > "first_name": "Sheri", > "position_id": "1" > }, > { > "full_name": "Derrick Whelply", > "employee_id": "2", > "last_name": "Whelply", > "position_title": "VP Country Manager", > "first_name": "Derrick", > "position_id": "2" > } > ] > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6668) In Web Console, highlight options that are different from default values
[ https://issues.apache.org/jira/browse/DRILL-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khatua updated DRILL-6668: Fix Version/s: 1.16.0 > In Web Console, highlight options that are different from default values > > > Key: DRILL-6668 > URL: https://issues.apache.org/jira/browse/DRILL-6668 > Project: Apache Drill > Issue Type: Improvement > Components: Web Server >Affects Versions: 1.14.0 >Reporter: Paul Rogers >Assignee: Kunal Khatua >Priority: Minor > Fix For: 1.16.0 > > > Suppose you inherit a Drill setup created by someone else (or by you, some > time in the past). Or, suppose you are a support person. You want to know > which Drill options have been changed from the defaults. > The Web UI conveniently displays all options. But, there is no indication of > which might have non-default values. > After the improvements of the last year, the information needed to detect > non-default values is now available. Would be great to mark these values. > Perhaps using colors, perhaps with words. > For example: > *planner.width.max_per_node* 200 \[Update] > Or > planner.width.max_per_node (system) 200 \[Update] > (The Web UI does not, I believe, show session settings, since the Web UI has > no sessions. I believe the custom values are all set by {{ALTER SYSTEM}}. > Otherwise, we could also have a "(session)" suffix above.) > Then, in addition to the {{[Update]}} button, for non default values, also > provide a {{[Reset]}} button that does the same as {{ALTER SESSION RESET}}. > planner.width.max_per_node (session) 200 \[Update] \[Reset] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (DRILL-6668) In Web Console, highlight options that are different from default values
[ https://issues.apache.org/jira/browse/DRILL-6668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khatua reassigned DRILL-6668: --- Assignee: Kunal Khatua > In Web Console, highlight options that are different from default values > > > Key: DRILL-6668 > URL: https://issues.apache.org/jira/browse/DRILL-6668 > Project: Apache Drill > Issue Type: Improvement > Components: Web Server >Affects Versions: 1.14.0 >Reporter: Paul Rogers >Assignee: Kunal Khatua >Priority: Minor > Fix For: 1.16.0 > > > Suppose you inherit a Drill setup created by someone else (or by you, some > time in the past). Or, suppose you are a support person. You want to know > which Drill options have been changed from the defaults. > The Web UI conveniently displays all options. But, there is no indication of > which might have non-default values. > After the improvements of the last year, the information needed to detect > non-default values is now available. Would be great to mark these values. > Perhaps using colors, perhaps with words. > For example: > *planner.width.max_per_node* 200 \[Update] > Or > planner.width.max_per_node (system) 200 \[Update] > (The Web UI does not, I believe, show session settings, since the Web UI has > no sessions. I believe the custom values are all set by {{ALTER SYSTEM}}. > Otherwise, we could also have a "(session)" suffix above.) > Then, in addition to the {{[Update]}} button, for non default values, also > provide a {{[Reset]}} button that does the same as {{ALTER SESSION RESET}}. > planner.width.max_per_node (session) 200 \[Update] \[Reset] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6851) Create Drill Operator for Kubernetes
[ https://issues.apache.org/jira/browse/DRILL-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687269#comment-16687269 ] Paul Rogers commented on DRILL-6851: [~agirish], my comment refers to using K8s to run a Drill cluster. (The work you've released this far is for an embedded Drill.) For a Drill cluster, the operator should: * Provide an API to add/remove Drillbits. * Provide an API to monitor the cluster. * Monitor Drillbits, restarting any Drillbit that fails. K8s provides stateful sets for simple clusters, but once you create an operator, you need to manage things like pod count, pod restart, etc. The point here is simply that DoY already uses ZK to monitor Drill cluster health. Already provides an API and UI for controlling the cluster. Of course, it does so in a YARN-like manner; some revision is needed to be K8s-like. > Create Drill Operator for Kubernetes > > > Key: DRILL-6851 > URL: https://issues.apache.org/jira/browse/DRILL-6851 > Project: Apache Drill > Issue Type: Task >Reporter: Abhishek Girish >Assignee: Abhishek Girish >Priority: Major > > This task is to track creating an initial version of the Drill Operator for > Kubernetes. I'll shortly update the JIRA on background, details on Operator, > and what's planned for the first version. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6851) Create Drill Operator for Kubernetes
[ https://issues.apache.org/jira/browse/DRILL-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687253#comment-16687253 ] Abhishek Girish commented on DRILL-6851: [~Paul.Rogers], I've already begun work in Go, so it's a challenge to borrow stuff from DoY. You can take a look at the early implementation here: https://github.com/Agirish/drill-operator I already have Drill running in distributed mode on K8S using YAML def & also with Helm. This supports Drill in both embedded & distributed modes. That work can be accessed here: https://github.com/Agirish/drill-containers I'm working on translating this into GO for the Drill Operator. The reason being that the operator would provide for much more flexibility & Drill specific customization. Also, current plan is to use mostly K8S for cluster management. I'm using the Operator SDK, which is pretty good. > Create Drill Operator for Kubernetes > > > Key: DRILL-6851 > URL: https://issues.apache.org/jira/browse/DRILL-6851 > Project: Apache Drill > Issue Type: Task >Reporter: Abhishek Girish >Assignee: Abhishek Girish >Priority: Major > > This task is to track creating an initial version of the Drill Operator for > Kubernetes. I'll shortly update the JIRA on background, details on Operator, > and what's planned for the first version. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6851) Create Drill Operator for Kubernetes
[ https://issues.apache.org/jira/browse/DRILL-6851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687098#comment-16687098 ] Paul Rogers commented on DRILL-6851: Needless to say, we should borrow liberally from Drill-on-YARN. This code includes YARN integration (to be replaced by K8s integration here). DoY also includes a complete Drill cluster management state machine and UI; features that we'd want to preserve in Drill-on-K8s (DoK). The challenge is that K8s operators are often implemented in Go. Might make sense to do DoK in Java so we can leverage our existing code. > Create Drill Operator for Kubernetes > > > Key: DRILL-6851 > URL: https://issues.apache.org/jira/browse/DRILL-6851 > Project: Apache Drill > Issue Type: Task >Reporter: Abhishek Girish >Assignee: Abhishek Girish >Priority: Major > > This task is to track creating an initial version of the Drill Operator for > Kubernetes. I'll shortly update the JIRA on background, details on Operator, > and what's planned for the first version. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6833) MapRDB queries with Run Time Filters with row_key/Secondary Index Should Support Pushdown
[ https://issues.apache.org/jira/browse/DRILL-6833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-6833: Reviewer: Aman Sinha > MapRDB queries with Run Time Filters with row_key/Secondary Index Should > Support Pushdown > - > > Key: DRILL-6833 > URL: https://issues.apache.org/jira/browse/DRILL-6833 > Project: Apache Drill > Issue Type: New Feature >Affects Versions: 1.15.0 >Reporter: Gautam Parai >Assignee: Gautam Parai >Priority: Major > Labels: ready-to-commit > Fix For: 1.15.0 > > > Drill should push down all row key filters to maprDB for queries that only > have WHERE conditions on row_keys. In the following example, the query only > has where clause on row_keys: > select t.mscIdentities from dfs.root.`/user/mapr/MixTable` t where t.row_key= > (select max(convert_fromutf8(i.KeyA.ENTRY_KEY)) from > dfs.root.`/user/mapr/TableIMSI` i where i.row_key='460021050005636') > row_keys can return at most 1 row. So the physical planning must leverage > MapRDB row_key push down to execute the sub query, with its results execute > the outer query. Currently only the inner query is pushed down. The outer > query requires a table scan. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6852) Adapt current Parquet Metadata cache implementation to use Drill Metastore API
Volodymyr Vysotskyi created DRILL-6852: -- Summary: Adapt current Parquet Metadata cache implementation to use Drill Metastore API Key: DRILL-6852 URL: https://issues.apache.org/jira/browse/DRILL-6852 Project: Apache Drill Issue Type: Sub-task Reporter: Volodymyr Vysotskyi Assignee: Volodymyr Vysotskyi According to the design document for DRILL-6552, existing metadata cache API should be adapted to use generalized API for metastore and parquet metadata cache will be presented as the implementation of metastore API. The aim of this Jira is to refactor Parquet Metadata cache implementation and adapt it to use Drill Metastore API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6852) Adapt current Parquet Metadata cache implementation to use Drill Metastore API
[ https://issues.apache.org/jira/browse/DRILL-6852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-6852: Fix Version/s: 1.16.0 > Adapt current Parquet Metadata cache implementation to use Drill Metastore API > -- > > Key: DRILL-6852 > URL: https://issues.apache.org/jira/browse/DRILL-6852 > Project: Apache Drill > Issue Type: Sub-task >Reporter: Volodymyr Vysotskyi >Assignee: Volodymyr Vysotskyi >Priority: Major > Fix For: 1.16.0 > > > According to the design document for DRILL-6552, existing metadata cache API > should be adapted to use generalized API for metastore and parquet metadata > cache will be presented as the implementation of metastore API. > The aim of this Jira is to refactor Parquet Metadata cache implementation and > adapt it to use Drill Metastore API. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6851) Create Drill Operator for Kubernetes
Abhishek Girish created DRILL-6851: -- Summary: Create Drill Operator for Kubernetes Key: DRILL-6851 URL: https://issues.apache.org/jira/browse/DRILL-6851 Project: Apache Drill Issue Type: Task Reporter: Abhishek Girish Assignee: Abhishek Girish This task is to track creating an initial version of the Drill Operator for Kubernetes. I'll shortly update the JIRA on background, details on Operator, and what's planned for the first version. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6830) Hook.REL_BUILDER_SIMPLIFY handler didn't removed cause performance degression
[ https://issues.apache.org/jira/browse/DRILL-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686818#comment-16686818 ] ASF GitHub Bot commented on DRILL-6830: --- vvysotskyi commented on issue #1524: DRILL-6830: Remove Hook.REL_BUILDER_SIMPLIFY handler after use URL: https://github.com/apache/drill/pull/1524#issuecomment-438724649 @lushuifeng, you are right that the problem connected with `TestCaseNullableTypes#testCaseNullableTypesVarchar` is in Calcite. The initial goal for adding `Hook.REL_BUILDER_SIMPLIFY.add(Hook.propertyJ(false));` was to avoid failures connected with treating empty strings as null feature, but looks like it was fixed in another place. The interesting thing is that problem connected with `TestCaseNullableTypes#testCaseNullableTypesVarchar` failure was fixed after 1.17 version, so when Drill is rebased onto Calcite 1.18, `Hook.REL_BUILDER_SIMPLIFY.add(Hook.propertyJ(false));` may be removed. @ihuzenko, since you are working on Calcite rebase, could you please also take care of it? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Hook.REL_BUILDER_SIMPLIFY handler didn't removed cause performance degression > - > > Key: DRILL-6830 > URL: https://issues.apache.org/jira/browse/DRILL-6830 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning Optimization >Affects Versions: 1.14.0 >Reporter: shuifeng lu >Assignee: shuifeng lu >Priority: Major > Fix For: 1.15.0 > > Attachments: Screen Shot 2018-11-06 at 16.14.16.png > > > Planning performance degression has been observed that the duration of > planning increased from 30ms to 160ms after running drill a long period of > time(say a month). > RelBuilder.simplify never becomes true if Hook.REL_BUILDER_SIMPLIFY handlers > are not removed. > Here is some clue (after running 40 days): > Hook.get takes 8ms per-invocation, it may be called serveral times per query. > ---[8.816063ms] org.apache.calcite.tools.RelBuilder:() > +---[0.020218ms] java.util.ArrayDeque:() > +---[0.018493ms] java.lang.Boolean:valueOf() > +---[8.341566ms] org.apache.calcite.runtime.Hook:get() > +---[0.008489ms] java.lang.Boolean:booleanValue() > +---[min=5.21E-4ms,max=0.015832ms,total=0.025233ms,count=12] > org.apache.calcite.plan.Context:unwrap() > +---[min=3.83E-4ms,max=0.009494ms,total=0.014516ms,count=13] > org.apache.calcite.util.Util:first() > +---[0.006892ms] org.apache.calcite.plan.RelOptCluster:getPlanner() > +---[0.009104ms] org.apache.calcite.plan.RelOptPlanner:getExecutor() > +---[min=4.8E-4ms,max=0.002277ms,total=0.002757ms,count=2] > org.apache.calcite.plan.RelOptCluster:getRexBuilder() > ---[min=4.91E-4ms,max=0.004586ms,total=0.005077ms,count=2] > org.apache.calcite.rex.RexSimplify:() > The top instances in JVM > num #instances #bytes class name > -- > 1: 116333 116250440 [B > 2: 890126 105084536 [C > 3: 338062 37415944 [Ljava.lang.Object; > 4: 1715004 27440064 org.apache.calcite.runtime.Hook$4 > 5: 803909 19293816 java.lang.String > !Screen Shot 2018-11-06 at 16.14.16.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6850) JDBC integration tests failures
[ https://issues.apache.org/jira/browse/DRILL-6850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalii Diravka updated DRILL-6850: --- Description: The following command will run Drill integratiom tests for RDBMS (Derby and MySQL): _mvn integration-test failsafe:integration-test -pl contrib/storage-jdbc_ Currently some drill/exec/store/jdbc TestJdbcPluginWithDerbyIT and TestJdbcPluginWithMySQLIT tests fail: {code} Results : Failed tests: TestJdbcPluginWithDerbyIT.showTablesDefaultSchema:117 expected:<1> but was:<0> Tests in error: TestJdbcPluginWithDerbyIT.describe » UserRemote VALIDATION ERROR: Unknown tabl... TestJdbcPluginWithDerbyIT.pushdownDoubleJoinAndFilter:111->PlanTestBase.testPlanMatchingPatterns:84->PlanTestBase.testPlanMatchingPatterns:89->PlanTestBase.getPlanInString:369->BaseTestQuery.testSqlWithResults:322->BaseTestQuery.testRunAndReturn:341 » Rpc TestJdbcPluginWithDerbyIT.testCrossSourceMultiFragmentJoin » UserRemote VALIDA... TestJdbcPluginWithDerbyIT.validateResult:71 » at position 0 column '`NUMERIC_... TestJdbcPluginWithMySQLIT.validateResult:108 » at position 0 column '`numeric... Tests run: 14, Failures: 1, Errors: 5, Skipped: 0 {code} Most likely these are old regressions. Additionally NPE for empty result is resolved: http://drill.apache.org/blog/2018/08/05/drill-1.14-released/#comment-4082559169 was: The following command will run Drill integratiom tests for RDBMS (Derby and MySQL): _mvn integration-test failsafe:integration-test -pl contrib/storage-jdbc_ Currently some drill/exec/store/jdbc TestJdbcPluginWithDerbyIT and TestJdbcPluginWithMySQLIT tests fail: {code} Results : Failed tests: TestJdbcPluginWithDerbyIT.showTablesDefaultSchema:117 expected:<1> but was:<0> Tests in error: TestJdbcPluginWithDerbyIT.describe » UserRemote VALIDATION ERROR: Unknown tabl... TestJdbcPluginWithDerbyIT.pushdownDoubleJoinAndFilter:111->PlanTestBase.testPlanMatchingPatterns:84->PlanTestBase.testPlanMatchingPatterns:89->PlanTestBase.getPlanInString:369->BaseTestQuery.testSqlWithResults:322->BaseTestQuery.testRunAndReturn:341 » Rpc TestJdbcPluginWithDerbyIT.testCrossSourceMultiFragmentJoin » UserRemote VALIDA... TestJdbcPluginWithDerbyIT.validateResult:71 » at position 0 column '`NUMERIC_... TestJdbcPluginWithMySQLIT.validateResult:108 » at position 0 column '`numeric... Tests run: 14, Failures: 1, Errors: 5, Skipped: 0 {code} Most likely these are old regressions. > JDBC integration tests failures > --- > > Key: DRILL-6850 > URL: https://issues.apache.org/jira/browse/DRILL-6850 > Project: Apache Drill > Issue Type: Bug > Components: Storage - JDBC >Affects Versions: 1.14.0 >Reporter: Vitalii Diravka >Priority: Major > Fix For: 1.15.0 > > > The following command will run Drill integratiom tests for RDBMS (Derby and > MySQL): > _mvn integration-test failsafe:integration-test -pl contrib/storage-jdbc_ > Currently some drill/exec/store/jdbc TestJdbcPluginWithDerbyIT and > TestJdbcPluginWithMySQLIT tests fail: > {code} > Results : > Failed tests: > TestJdbcPluginWithDerbyIT.showTablesDefaultSchema:117 expected:<1> but > was:<0> > Tests in error: > TestJdbcPluginWithDerbyIT.describe » UserRemote VALIDATION ERROR: Unknown > tabl... > > TestJdbcPluginWithDerbyIT.pushdownDoubleJoinAndFilter:111->PlanTestBase.testPlanMatchingPatterns:84->PlanTestBase.testPlanMatchingPatterns:89->PlanTestBase.getPlanInString:369->BaseTestQuery.testSqlWithResults:322->BaseTestQuery.testRunAndReturn:341 > » Rpc > TestJdbcPluginWithDerbyIT.testCrossSourceMultiFragmentJoin » UserRemote > VALIDA... > TestJdbcPluginWithDerbyIT.validateResult:71 » at position 0 column > '`NUMERIC_... > TestJdbcPluginWithMySQLIT.validateResult:108 » at position 0 column > '`numeric... > Tests run: 14, Failures: 1, Errors: 5, Skipped: 0 > {code} > Most likely these are old regressions. > Additionally NPE for empty result is resolved: > http://drill.apache.org/blog/2018/08/05/drill-1.14-released/#comment-4082559169 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-540) Allow querying hive views in drill
[ https://issues.apache.org/jira/browse/DRILL-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-540: --- Description: Currently hive views cannot be queried from drill. This Jira aims to add support for Hive views in Drill. *Implementation details:* # Drill persists it's views metadata in file with suffix .view.drill using json format. For example: {noformat} { "name" : "view_from_calcite_1_4", "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", "fields" : [ { "name" : "*", "type" : "ANY", "isNullable" : true } ], "workspaceSchemaPath" : [ "dfs", "tmp" ] } {noformat} Later Drill parses the metadata and uses it to treat view names in SQL as a subquery. 2. In Apache Hive metadata about views is stored in similar way to tables. Below is example from metastore.TBLS : {noformat} TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | ---||--|-|--|--|--|--|--|---| 2 |1542111078 |1 |0|mapr |0 |2 |cview |VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` | {noformat} 3. So in Hive metastore views are considered as tables of special type. And main benefit is that we also have expanded SQL definition of views (just like in view.drill files). Also reading of the metadata is already implemented in Drill with help of thrift Metastore API. 4. To enable querying of Hive views we'll reuse existing code for Drill views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for _*HiveReadEntry*_ we'll convert the metadata to instance of _*View*_ (_which is actually model for data persisted in .view.drill files_) and then based on this instance return new _*DrillViewTable*_. Using this approach drill will handle hive views the same way as if it was initially defined in Drill and persisted in .view.drill file. 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ we'll reuse existing code from _*DrillHiveTable*_, so the conversion functionality will be extracted and used for both (table and view) fields type conversions. was: Currently hive views cannot be queried from drill. This Jira aims to add support for Hive views in Drill. *Implementation details:* # Drill persists it's views metadata in file with suffix .view.drill using json format. For example: {noformat} { "name" : "view_from_calcite_1_4", "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", "fields" : [ { "name" : "*", "type" : "ANY", "isNullable" : true } ], "workspaceSchemaPath" : [ "dfs", "tmp" ] } {noformat} Later Drill parses the metadata and uses it to treat view names in SQL as a subquery. 2. In Apache Hive metadata about views is stored in similar way to tables. Below is example from metastore.TBLS : {noformat} TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | ---||--|-|--|--|--|--|--|---| 2 |1542111078 |1 |0|mapr |0 |2 |cview |VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` | {noformat} 3. So in Hive metastore views are considered as tables of special type. And main benefit is that we also have expanded SQL definition of views (just like in view.drill files). Also reading of the metadata is already implemented in Drill with help of thrift Metastore API. 4. To enable querying of Hive views I'll reuse existing code for Drill views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for _*HiveReadEntry*_ we'll convert the metadata to instance of _*View*_ (_which is actually model for data persisted in .view.drill files_) and then based on this instance return new _*DrillViewTable*_. Using this approach drill will handle hive views the same way as if it was initially defined in Drill and persisted in .view.drill file. 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ we'll reuse existing code from _*DrillHiveTable*_, so the conversion functionality will be extracted and used for both (table and view) fields type conversions. > Allow querying hive views in drill > -- > > Key: DRILL-540 > URL: https://issues.apache.org/jira/browse/DRILL-540 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - Hive >Reporter: Ramana Inukonda Nagaraj >Assignee: Igor Guzenko >Priority: Major > Labels:
[jira] [Updated] (DRILL-540) Allow querying hive views in Drill
[ https://issues.apache.org/jira/browse/DRILL-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-540: --- Summary: Allow querying hive views in Drill (was: Allow querying hive views in drill) > Allow querying hive views in Drill > -- > > Key: DRILL-540 > URL: https://issues.apache.org/jira/browse/DRILL-540 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - Hive >Reporter: Ramana Inukonda Nagaraj >Assignee: Igor Guzenko >Priority: Major > Labels: doc-impacting > Fix For: 1.16.0 > > > Currently hive views cannot be queried from drill. > This Jira aims to add support for Hive views in Drill. > *Implementation details:* > # Drill persists it's views metadata in file with suffix .view.drill using > json format. For example: > {noformat} > { > "name" : "view_from_calcite_1_4", > "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", > "fields" : [ { > "name" : "*", > "type" : "ANY", > "isNullable" : true > } ], > "workspaceSchemaPath" : [ "dfs", "tmp" ] > } > {noformat} > Later Drill parses the metadata and uses it to treat view names in SQL as a > subquery. > 2. In Apache Hive metadata about views is stored in similar way to > tables. Below is example from metastore.TBLS : > > {noformat} > TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID > |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | > ---||--|-|--|--|--|--|--|---| > 2 |1542111078 |1 |0|mapr |0 |2 |cview >|VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` | > {noformat} > 3. So in Hive metastore views are considered as tables of special type. > And main benefit is that we also have expanded SQL definition of views (just > like in view.drill files). Also reading of the metadata is already > implemented in Drill with help of thrift Metastore API. > 4. To enable querying of Hive views we'll reuse existing code for Drill > views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for > _*HiveReadEntry*_ we'll convert the metadata to instance of _*View*_ (_which > is actually model for data persisted in .view.drill files_) and then based on > this instance return new _*DrillViewTable*_. Using this approach drill will > handle hive views the same way as if it was initially defined in Drill and > persisted in .view.drill file. > 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ > we'll reuse existing code from _*DrillHiveTable*_, so the conversion > functionality will be extracted and used for both (table and view) fields > type conversions. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-540) Allow querying hive views in drill
[ https://issues.apache.org/jira/browse/DRILL-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-540: --- Description: Currently hive views cannot be queried from drill. This Jira aims to add support for Hive views in Drill. *Implementation details:* # Drill persists it's views metadata in file with suffix .view.drill using json format. For example: {noformat} { "name" : "view_from_calcite_1_4", "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", "fields" : [ { "name" : "*", "type" : "ANY", "isNullable" : true } ], "workspaceSchemaPath" : [ "dfs", "tmp" ] } {noformat} Later Drill parses the metadata and uses it to treat view names in SQL as a subquery. 2. In Apache Hive metadata about views is stored in similar way to tables. Below is example from metastore.TBLS : {noformat} TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | ---||--|-|--|--|--|--|--|---| 2 |1542111078 |1 |0|mapr |0 |2 |cview |VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` | {noformat} 3. So in Hive metastore views are considered as tables of special type. And main benefit is that we also have expanded SQL definition of views (just like in view.drill files). Also reading of the metadata is already implemented in Drill with help of thrift Metastore API. 4. To enable querying of Hive views I'll reuse existing code for Drill views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for _*HiveReadEntry*_ we'll convert the metadata to instance of _*View*_ (_which is actually model for data persisted in .view.drill files_) and then based on this instance return new _*DrillViewTable*_. Using this approach drill will handle hive views the same way as if it was initially defined in Drill and persisted in .view.drill file. 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ we'll reuse existing code from _*DrillHiveTable*_, so the conversion functionality will be extracted and used for both (table and view) fields type conversions. was: Currently hive views cannot be queried from drill. *Suggested approach* # Drill persists it's views metadata in file with suffix .view.drill using json format. For example: {noformat} { "name" : "view_from_calcite_1_4", "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", "fields" : [ { "name" : "*", "type" : "ANY", "isNullable" : true } ], "workspaceSchemaPath" : [ "dfs", "tmp" ] } {noformat} Later Drill parses the metadata and uses it to treat view names in SQL as a subquery. 2. In Apache Hive metadata about views is stored in similar way to tables. Below is example from metastore.TBLS : {noformat} TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | ---||--|-|--|--|--|--|--|---| 2 |1542111078 |1 |0|mapr |0 |2 |cview |VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` | {noformat} 3. So in Hive metastore views are considered as tables of special type. And main benefit is that we also have expanded SQL definition of views (just like in view.drill files). Also reading of the metadata is already implemented in Drill with help of thrift Metastore API. 4. To enable querying of Hive views I'll reuse existing code for Drill views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for _*HiveReadEntry*_ we'll convert the metadata to instance of _*View*_ (_which is actually model for data persisted in .view.drill files_) and then based on this instance return new _*DrillViewTable*_. Using this approach drill will handle hive views the same way as if it was initially defined in Drill and persisted in .view.drill file. 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ we'll reuse existing code from _*DrillHiveTable*_, so the conversion functionality will be extracted and used for both (table and view) fields type conversions. > Allow querying hive views in drill > -- > > Key: DRILL-540 > URL: https://issues.apache.org/jira/browse/DRILL-540 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - Hive >Reporter: Ramana Inukonda Nagaraj >Assignee: Igor Guzenko >Priority: Major > Labels: doc-impacting > Fix For: 1.16.0 > > > Currently hive
[jira] [Updated] (DRILL-540) Allow querying hive views in drill
[ https://issues.apache.org/jira/browse/DRILL-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-540: --- Description: Currently hive views cannot be queried from drill. *Suggested approach* # Drill persists it's views metadata in file with suffix .view.drill using json format. For example: {noformat} { "name" : "view_from_calcite_1_4", "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", "fields" : [ { "name" : "*", "type" : "ANY", "isNullable" : true } ], "workspaceSchemaPath" : [ "dfs", "tmp" ] } {noformat} Later Drill parses the metadata and uses it to treat view names in SQL as a subquery. 2. In Apache Hive metadata about views is stored in similar way to tables. Below is example from metastore.TBLS : {noformat} TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | ---||--|-|--|--|--|--|--|---| 2 |1542111078 |1 |0|mapr |0 |2 |cview |VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` | {noformat} 3. So in Hive metastore views are considered as tables of special type. And main benefit is that we also have expanded SQL definition of views (just like in view.drill files). Also reading of the metadata is already implemented in Drill with help of thrift Metastore API. 4. To enable querying of Hive views I'll reuse existing code for Drill views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for _*HiveReadEntry*_ we'll convert the metadata to instance of _*View*_ (_which is actually model for data persisted in .view.drill files_) and then based on this instance return new _*DrillViewTable*_. Using this approach drill will handle hive views the same way as if it was initially defined in Drill and persisted in .view.drill file. 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ we'll reuse existing code from _*DrillHiveTable*_, so the conversion functionality will be extracted and used for both (table and view) fields type conversions. was: Currently hive views cannot be queried from drill. *Suggested approach* # Drill persists it's views metadata in file with suffix .view.drill using json format. For example: {noformat} { "name" : "view_from_calcite_1_4", "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", "fields" : [ { "name" : "*", "type" : "ANY", "isNullable" : true } ], "workspaceSchemaPath" : [ "dfs", "tmp" ] }{noformat} Later drill parses the metadata and uses it to treat view names in SQL as a subquery. 2. In Apache Hive metadata about views is stored in similar way to tables. Below is example from metastore.TBLS : {noformat} TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | ---||--|-|--|--|--|--|--|---| 2 |1542111078 |1 |0|mapr |0 |2 |cview |VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` | {noformat} 3. So in Hive metastore views are considered as tables of special type. And main benefit is that we also have expanded SQL definition of views (just like in view.drill files). Also reading of the metadata is already implemented in Drill with help of thrift Metastore API. 4. To enable querying of Hive views I'll reuse existing code for Drill views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for _*HiveReadEntry*_ we'll convert the metadata to instance of _*View*_ (_which is actually model for data persisted in .view.drill files_) and then based on this instance return new _*DrillViewTable*_. Using this approach drill will handle hive views the same way as if it was initially defined in Drill and persisted in .view.drill file. 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ we'll reuse existing code from _*DrillHiveTable*_, so the conversion functionality will be extracted and used for both (table and view) fields type conversions. > Allow querying hive views in drill > -- > > Key: DRILL-540 > URL: https://issues.apache.org/jira/browse/DRILL-540 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - Hive >Reporter: Ramana Inukonda Nagaraj >Assignee: Igor Guzenko >Priority: Major > Labels: doc-impacting > Fix For: 1.16.0 > > > Currently hive views cannot be queried from drill. > *Suggested
[jira] [Updated] (DRILL-540) Allow querying hive views in drill
[ https://issues.apache.org/jira/browse/DRILL-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-540: --- Description: Currently hive views cannot be queried from drill. *Suggested approach* # Drill persists it's views metadata in file with suffix .view.drill using json format. For example: {noformat} { "name" : "view_from_calcite_1_4", "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", "fields" : [ { "name" : "*", "type" : "ANY", "isNullable" : true } ], "workspaceSchemaPath" : [ "dfs", "tmp" ] }{noformat} Later drill parses the metadata and uses it to treat view names in SQL as a subquery. 2. In Apache Hive metadata about views is stored in similar way to tables. Below is example from metastore.TBLS : {noformat} TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | ---||--|-|--|--|--|--|--|---| 2 |1542111078 |1 |0|mapr |0 |2 |cview |VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` | {noformat} 3. So in Hive metastore views are considered as tables of special type. And main benefit is that we also have expanded SQL definition of views (just like in view.drill files). Also reading of the metadata is already implemented in Drill with help of thrift Metastore API. 4. To enable querying of Hive views I'll reuse existing code for Drill views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for _*HiveReadEntry*_ we'll convert the metadata to instance of _*View*_ (_which is actually model for data persisted in .view.drill files_) and then based on this instance return new _*DrillViewTable*_. Using this approach drill will handle hive views the same way as if it was initially defined in Drill and persisted in .view.drill file. 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ we'll reuse existing code from _*DrillHiveTable*_, so the conversion functionality will be extracted and used for both (table and view) fields type conversions. was: Currently hive views cannot be queried from drill. *Suggested approach* # Drill persists it's views metadata in file with suffix .view.drill using json format. For example: {noformat} { "name" : "view_from_calcite_1_4", "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", "fields" : [ { "name" : "*", "type" : "ANY", "isNullable" : true } ], "workspaceSchemaPath" : [ "dfs", "tmp" ] }{noformat} Later drill parses the metadata and uses it to treat view names in SQL as a subquery. 2. In Apache Hive metadata about views is stored in similar way to tables. Below is example from metastore.TBLS : {noformat} TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | ---||--|-|--|--|--|--|--|---| 2 |1542111078 |1 |0|mapr |0 |2 |cview |VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` | {noformat} 3. So in Hive metastore views are considered as tables of special type. And main benefit is that we also have expanded SQL definition of views (just like in view.drill files). Also reading of the metadata is already implemented in Drill with help of thrift Metastore API. 4. To enable querying of Hive views I'll reuse existing code for Drill views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for _*HiveReadEntry*_ I'll convert the metadata to instance of _*View*_ (_which is actually model for data persisted in .view.drill files_) and then based on this instance return new _*DrillViewTable*_. Using this approach drill will handle hive views the same way as if it was initially defined in Drill and persisted in .view.drill file. 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ I'll reuse existing code from _*DrillHiveTable*_, so the conversion functionality will be extracted and used for both (table and view) fields type conversions. > Allow querying hive views in drill > -- > > Key: DRILL-540 > URL: https://issues.apache.org/jira/browse/DRILL-540 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - Hive >Reporter: Ramana Inukonda Nagaraj >Assignee: Igor Guzenko >Priority: Major > Labels: doc-impacting > Fix For: 1.16.0 > > > Currently hive views cannot be queried from drill. > *Suggested
[jira] [Updated] (DRILL-6850) JDBC integration tests failures
[ https://issues.apache.org/jira/browse/DRILL-6850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalii Diravka updated DRILL-6850: --- Summary: JDBC integration tests failures (was: JDBC integration tests) > JDBC integration tests failures > --- > > Key: DRILL-6850 > URL: https://issues.apache.org/jira/browse/DRILL-6850 > Project: Apache Drill > Issue Type: Bug > Components: Storage - JDBC >Affects Versions: 1.14.0 >Reporter: Vitalii Diravka >Priority: Major > Fix For: 1.15.0 > > > The following command will run Drill integratiom tests for RDBMS (Derby and > MySQL): > _mvn integration-test failsafe:integration-test -pl contrib/storage-jdbc_ > Currently some drill/exec/store/jdbc TestJdbcPluginWithDerbyIT and > TestJdbcPluginWithMySQLIT tests fail: > {code} > Results : > Failed tests: > TestJdbcPluginWithDerbyIT.showTablesDefaultSchema:117 expected:<1> but > was:<0> > Tests in error: > TestJdbcPluginWithDerbyIT.describe » UserRemote VALIDATION ERROR: Unknown > tabl... > > TestJdbcPluginWithDerbyIT.pushdownDoubleJoinAndFilter:111->PlanTestBase.testPlanMatchingPatterns:84->PlanTestBase.testPlanMatchingPatterns:89->PlanTestBase.getPlanInString:369->BaseTestQuery.testSqlWithResults:322->BaseTestQuery.testRunAndReturn:341 > » Rpc > TestJdbcPluginWithDerbyIT.testCrossSourceMultiFragmentJoin » UserRemote > VALIDA... > TestJdbcPluginWithDerbyIT.validateResult:71 » at position 0 column > '`NUMERIC_... > TestJdbcPluginWithMySQLIT.validateResult:108 » at position 0 column > '`numeric... > Tests run: 14, Failures: 1, Errors: 5, Skipped: 0 > {code} > Most likely these are old regressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-540) Allow querying hive views in drill
[ https://issues.apache.org/jira/browse/DRILL-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arina Ielchiieva updated DRILL-540: --- Description: Currently hive views cannot be queried from drill. *Suggested approach* # Drill persists it's views metadata in file with suffix .view.drill using json format. For example: {noformat} { "name" : "view_from_calcite_1_4", "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", "fields" : [ { "name" : "*", "type" : "ANY", "isNullable" : true } ], "workspaceSchemaPath" : [ "dfs", "tmp" ] }{noformat} Later drill parses the metadata and uses it to treat view names in SQL as a subquery. 2. In Apache Hive metadata about views is stored in similar way to tables. Below is example from metastore.TBLS : {noformat} TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | ---||--|-|--|--|--|--|--|---| 2 |1542111078 |1 |0|mapr |0 |2 |cview |VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` | {noformat} 3. So in Hive metastore views are considered as tables of special type. And main benefit is that we also have expanded SQL definition of views (just like in view.drill files). Also reading of the metadata is already implemented in Drill with help of thrift Metastore API. 4. To enable querying of Hive views I'll reuse existing code for Drill views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for _*HiveReadEntry*_ I'll convert the metadata to instance of _*View*_ (_which is actually model for data persisted in .view.drill files_) and then based on this instance return new _*DrillViewTable*_. Using this approach drill will handle hive views the same way as if it was initially defined in Drill and persisted in .view.drill file. 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ I'll reuse existing code from _*DrillHiveTable*_, so the conversion functionality will be extracted and used for both (table and view) fields type conversions. was: Currently hive views cannot be queried from drill. *Suggested approach* # Drill persists it's views metadata in file with suffix .view.drill using json format. For example: {noformat} { "name" : "view_from_calcite_1_4", "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", "fields" : [ { "name" : "*", "type" : "ANY", "isNullable" : true } ], "workspaceSchemaPath" : [ "dfs", "tmp" ] }{noformat} Later drill parses the metadata and uses it to treat view names in SQL as a subquery. 2. In Apache Hive metadata about views is stored in similar way to tables. Below is example from metastore.TBLS : {noformat} TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | ---||--|-|--|--|--|--|--|---| 2 |1542111078 |1 |0|mapr |0 |2 |cview |VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` |{noformat} 3. So in Hive metastore views are considered as tables of special type. And main benefit is that we also have expanded SQL definition of views (just like in view.drill files). Also reading of the metadata is already implemented in Drill with help of thrift Metastore API. 4. To enable querying of Hive views I'll reuse existing code for Drill views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for _*HiveReadEntry*_ I'll convert the metadata to instance of _*View*_ (_which is actually model for data persisted in .view.drill files_) and then based on this instance return new _*DrillViewTable*_. Using this approach drill will handle hive views the same way as if it was initially defined in Drill and persisted in .view.drill file. 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ I'll reuse existing code from _*DrillHiveTable*_, so the conversion functionality will be extracted and used for both (table and view) fields type conversions. > Allow querying hive views in drill > -- > > Key: DRILL-540 > URL: https://issues.apache.org/jira/browse/DRILL-540 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - Hive >Reporter: Ramana Inukonda Nagaraj >Assignee: Igor Guzenko >Priority: Major > Labels: doc-impacting > Fix For: 1.16.0 > > > Currently hive views cannot be queried from drill. > *Suggested
[jira] [Created] (DRILL-6850) JDBC integration tests
Vitalii Diravka created DRILL-6850: -- Summary: JDBC integration tests Key: DRILL-6850 URL: https://issues.apache.org/jira/browse/DRILL-6850 Project: Apache Drill Issue Type: Bug Components: Storage - JDBC Affects Versions: 1.14.0 Reporter: Vitalii Diravka Fix For: 1.15.0 The following command will run Drill integratiom tests for RDBMS (Derby and MySQL): _mvn integration-test failsafe:integration-test -pl contrib/storage-jdbc_ Currently some drill/exec/store/jdbc TestJdbcPluginWithDerbyIT and TestJdbcPluginWithMySQLIT tests fail: {code} Results : Failed tests: TestJdbcPluginWithDerbyIT.showTablesDefaultSchema:117 expected:<1> but was:<0> Tests in error: TestJdbcPluginWithDerbyIT.describe » UserRemote VALIDATION ERROR: Unknown tabl... TestJdbcPluginWithDerbyIT.pushdownDoubleJoinAndFilter:111->PlanTestBase.testPlanMatchingPatterns:84->PlanTestBase.testPlanMatchingPatterns:89->PlanTestBase.getPlanInString:369->BaseTestQuery.testSqlWithResults:322->BaseTestQuery.testRunAndReturn:341 » Rpc TestJdbcPluginWithDerbyIT.testCrossSourceMultiFragmentJoin » UserRemote VALIDA... TestJdbcPluginWithDerbyIT.validateResult:71 » at position 0 column '`NUMERIC_... TestJdbcPluginWithMySQLIT.validateResult:108 » at position 0 column '`numeric... Tests run: 14, Failures: 1, Errors: 5, Skipped: 0 {code} Most likely these are old regressions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-540) Allow querying hive views in drill
[ https://issues.apache.org/jira/browse/DRILL-540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Igor Guzenko updated DRILL-540: --- Description: Currently hive views cannot be queried from drill. *Suggested approach* # Drill persists it's views metadata in file with suffix .view.drill using json format. For example: {noformat} { "name" : "view_from_calcite_1_4", "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", "fields" : [ { "name" : "*", "type" : "ANY", "isNullable" : true } ], "workspaceSchemaPath" : [ "dfs", "tmp" ] }{noformat} Later drill parses the metadata and uses it to treat view names in SQL as a subquery. 2. In Apache Hive metadata about views is stored in similar way to tables. Below is example from metastore.TBLS : {noformat} TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | ---||--|-|--|--|--|--|--|---| 2 |1542111078 |1 |0|mapr |0 |2 |cview |VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` |{noformat} 3. So in Hive metastore views are considered as tables of special type. And main benefit is that we also have expanded SQL definition of views (just like in view.drill files). Also reading of the metadata is already implemented in Drill with help of thrift Metastore API. 4. To enable querying of Hive views I'll reuse existing code for Drill views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for _*HiveReadEntry*_ I'll convert the metadata to instance of _*View*_ (_which is actually model for data persisted in .view.drill files_) and then based on this instance return new _*DrillViewTable*_. Using this approach drill will handle hive views the same way as if it was initially defined in Drill and persisted in .view.drill file. 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ I'll reuse existing code from _*DrillHiveTable*_, so the conversion functionality will be extracted and used for both (table and view) fields type conversions. was: Currently hive views cannot be queried from drill. > Allow querying hive views in drill > -- > > Key: DRILL-540 > URL: https://issues.apache.org/jira/browse/DRILL-540 > Project: Apache Drill > Issue Type: New Feature > Components: Storage - Hive >Reporter: Ramana Inukonda Nagaraj >Assignee: Igor Guzenko >Priority: Major > Labels: doc-impacting > Fix For: 1.16.0 > > > Currently hive views cannot be queried from drill. > *Suggested approach* > # Drill persists it's views metadata in file with suffix .view.drill using > json format. For example: > {noformat} > { > "name" : "view_from_calcite_1_4", > "sql" : "SELECT * FROM `cp`.`store.json`WHERE `store_id` = 0", > "fields" : [ { > "name" : "*", > "type" : "ANY", > "isNullable" : true > } ], > "workspaceSchemaPath" : [ "dfs", "tmp" ] > }{noformat} > Later drill parses the metadata and uses it to treat view names in > SQL as a subquery. > 2. In Apache Hive metadata about views is stored in similar way to > tables. Below is example from metastore.TBLS : > > {noformat} > TBL_ID |CREATE_TIME |DB_ID |LAST_ACCESS_TIME |OWNER |RETENTION |SD_ID > |TBL_NAME |TBL_TYPE |VIEW_EXPANDED_TEXT | > ---||--|-|--|--|--|--|--|---| > 2 |1542111078 |1 |0|mapr |0 |2 |cview >|VIRTUAL_VIEW |SELECT COUNT(*) FROM `default`.`customers` |{noformat} > 3. So in Hive metastore views are considered as tables of special type. > And main benefit is that we also have expanded SQL definition of views (just > like in view.drill files). Also reading of the metadata is already > implemented in Drill with help of thrift Metastore API. > 4. To enable querying of Hive views I'll reuse existing code for Drill > views as much as possible. First in *_HiveSchemaFactory.getDrillTable_* for > _*HiveReadEntry*_ I'll convert the metadata to instance of _*View*_ (_which > is actually model for data persisted in .view.drill files_) and then based on > this instance return new _*DrillViewTable*_. Using this approach drill will > handle hive views the same way as if it was initially defined in Drill and > persisted in .view.drill file. > 5. For conversion of Hive types: from _*FieldSchema*_ to _*RelDataType*_ > I'll reuse existing code from _*DrillHiveTable*_, so the conversion > functionality will be extracted and used for both
[jira] [Updated] (DRILL-6744) Support filter push down for varchar / decimal data types
[ https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Volodymyr Vysotskyi updated DRILL-6744: --- Labels: doc-impacting ready-to-commit (was: doc-impacting) > Support filter push down for varchar / decimal data types > - > > Key: DRILL-6744 > URL: https://issues.apache.org/jira/browse/DRILL-6744 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.14.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.15.0 > > > Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly > stored varchar / decimal min / max statistics is resolved, we should add > support for varchar / decimal filter push down. Only files created with > parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In > cases if user knows that prior created files have correct min / max > statistics (i.e. user exactly knows that data in binary columns in ASCII (not > UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to > enable filter push down. > *Description* > _Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._ > *Varchar Partition Pruning* > Varchar Pruning will work for files generated prior and after Parquet 1.10.0 > version, since to enable partition pruning both min and max values should be > the same and there are no issues with incorrectly stored statistics for > binary data for the same min and max values. Partition pruning using Drill > metadata files will also work, no matter when metadata file was created > (prior or after Drill 1.15.0). > Partition pruning won't work for files where partition is null due to > PARQUET-1341, issue will be fixed in Parquet 1.11.0. > *Varchar Filter Push Down* > Varchar filter push down will work for parquet files created with Parquet > 1.10.0 and later. > There are two options how to enable push down for files generated with prior > Parquet versions, when user exactly knows that binary data is in ASCII (not > UTF-8): > 1. set configuration {{enableStringsSignedMinMax}} to true (false by default) > for parquet format plugin: > {noformat} > "parquet" : { > type: "parquet", > enableStringsSignedMinMax: true > } > {noformat} > This would apply to all parquet files of a given file plugin, including all > workspaces. > 2. If user wants to enable / disable allowing reading binary statistics for > old parquet files per session, session option > {{store.parquet.reader.strings_signed_min_max}} can be used. By default, it > has empty string value. Setting such option will take priority over config in > parquet format plugin. Option allows three values: 'true', 'false', '' (empty > string). > _Note: store.parquet.reader.strings_signed_min_max also can be set at system > level, thus it will apply to all parquet files in the system._ > The same config / session option will apply to allow reading binary > statistics from Drill metadata files generated prior to Drill 1.15.0. If > Drill metadata file was created prior to Drill 1.15.0 but for parquet files > created with Parquet library 1.10.0 and later, user would have to enable > config / session option or regenerate Drill metadata file with Drill 1.15.0 > or later, because from the metadata file we don't know if statistics is > stored correctly (prior Drill was writing reading and writing binary > statistics by default though did not use it). > When creating Drill metadata file with Drill 1.15.0 and later for old parquet > files, user should mind config / session option. If strings_signed_min_max is > enabled, Drill will store in the Drill metadata file binary statistics but > since metadata file was created with Drill 1.15.0 and later, Drill would read > it back disregarding the option (assuming that if statistics is present in > the Drill metadata file, it is correct). If user mistakenly enabled > strings_signed_min_max, he needs to disable it and regenerated Drill metadata > file. The same is in the opposite way, if user created metadata file when > strings_signed_min_max was disabled, no min / max values for binary > statistics will be written and thus read back, even if during reading the > metadata strings_signed_min_max is enabled. > *Decimal Partition Pruning* > Decimal values can be represented in four logical types: int_32, int_64, > fixed_len_byte_array and binary. > Partition pruning will work for all logical types for old and new decimal > files, i.e. created with Parquet 1.10.0, prior and after. Partition pruning > won't work for files with null partition due to PARQUET-1341 which will be > fixed in Parquet 1.11.0. >
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686540#comment-16686540 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233458212 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/scan/project/TestSchemaSmoothing.java ## @@ -0,0 +1,681 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan.project; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNotEquals; +import static org.junit.Assert.assertTrue; +import static org.junit.Assert.fail; + +import java.util.List; + +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.physical.impl.protocol.SchemaTracker; +import org.apache.drill.exec.physical.impl.scan.ScanTestUtils; +import org.apache.drill.exec.physical.impl.scan.project.NullColumnBuilder; +import org.apache.drill.exec.physical.impl.scan.project.ResolvedColumn; +import org.apache.drill.exec.physical.impl.scan.project.ResolvedTuple.ResolvedRow; +import org.apache.drill.exec.physical.impl.scan.project.ScanLevelProjection; +import org.apache.drill.exec.physical.impl.scan.project.ScanSchemaOrchestrator; +import org.apache.drill.exec.physical.impl.scan.project.ScanSchemaOrchestrator.ReaderSchemaOrchestrator; +import org.apache.drill.exec.physical.impl.scan.project.SchemaSmoother; +import org.apache.drill.exec.physical.impl.scan.project.SchemaSmoother.IncompatibleSchemaException; +import org.apache.drill.exec.physical.impl.scan.project.SmoothingProjection; +import org.apache.drill.exec.physical.impl.scan.project.WildcardSchemaProjection; +import org.apache.drill.exec.physical.rowSet.ResultSetLoader; +import org.apache.drill.exec.physical.rowSet.impl.RowSetTestUtils; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.test.SubOperatorTest; +import org.apache.drill.test.rowSet.RowSet.SingleRowSet; +import org.apache.drill.test.rowSet.RowSetComparison; +import org.apache.drill.test.rowSet.schema.SchemaBuilder; +import org.junit.Test; + +/** + * Tests schema smoothing at the schema projection level. + * This level handles reusing prior types when filling null + * values. But, because no actual vectors are involved, it + * does not handle the schema chosen for a table ahead of + * time, only the schema as it is merged with prior schema to + * detect missing columns. + * + * Focuses on the SmoothingProjection class itself. + * + * Note that, at present, schema smoothing does not work for entire + * maps. That is, if file 1 has, say {a: {b: 10, c: "foo"}} + * and file 2 has, say, {a: null}, then schema smoothing does + * not currently know how to recreate the map. The same is true of + * lists and unions. Handling such cases is complex and is probably + * better handled via a system that allows the user to specify their + * intent by providing a schema to apply to the two files. + */ + +public class TestSchemaSmoothing extends SubOperatorTest { + + /** + * Low-level test of the smoothing projection, including the exceptions + * it throws when things are not going its way. + */ + + @Test + public void testSmoothingProjection() { +final ScanLevelProjection scanProj = new ScanLevelProjection( +RowSetTestUtils.projectAll(), +ScanTestUtils.parsers()); + +// Table 1: (a: nullable bigint, b) + +final TupleMetadata schema1 = new SchemaBuilder() +.addNullable("a", MinorType.BIGINT) +.addNullable("b", MinorType.VARCHAR) +.add("c", MinorType.FLOAT8) +.buildSchema(); +ResolvedRow priorSchema; +{ + final NullColumnBuilder builder = new NullColumnBuilder(null, false); + final ResolvedRow rootTuple = new ResolvedRow(builder); + new WildcardSchemaProjection( + scanProj, schema1, rootTuple, + ScanTestUtils.resolvers()); + priorSchema = rootTuple; +} + +// Table 2: (a:
[jira] [Commented] (DRILL-6744) Support filter push down for varchar / decimal data types
[ https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686559#comment-16686559 ] ASF GitHub Bot commented on DRILL-6744: --- arina-ielchiieva commented on issue #1537: DRILL-6744: Support varchar and decimal push down URL: https://github.com/apache/drill/pull/1537#issuecomment-438675122 @vvysotskyi addressed code review comments: 1. Used VersionUtil from Hadoop lib. 2. Made ParquetReaderConfig immutable. 3. Added trace login instead of removing default in switch. 4. Made other changes as requested. Thanks for the code review. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support filter push down for varchar / decimal data types > - > > Key: DRILL-6744 > URL: https://issues.apache.org/jira/browse/DRILL-6744 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.14.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.15.0 > > > Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly > stored varchar / decimal min / max statistics is resolved, we should add > support for varchar / decimal filter push down. Only files created with > parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In > cases if user knows that prior created files have correct min / max > statistics (i.e. user exactly knows that data in binary columns in ASCII (not > UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to > enable filter push down. > *Description* > _Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._ > *Varchar Partition Pruning* > Varchar Pruning will work for files generated prior and after Parquet 1.10.0 > version, since to enable partition pruning both min and max values should be > the same and there are no issues with incorrectly stored statistics for > binary data for the same min and max values. Partition pruning using Drill > metadata files will also work, no matter when metadata file was created > (prior or after Drill 1.15.0). > Partition pruning won't work for files where partition is null due to > PARQUET-1341, issue will be fixed in Parquet 1.11.0. > *Varchar Filter Push Down* > Varchar filter push down will work for parquet files created with Parquet > 1.10.0 and later. > There are two options how to enable push down for files generated with prior > Parquet versions, when user exactly knows that binary data is in ASCII (not > UTF-8): > 1. set configuration {{enableStringsSignedMinMax}} to true (false by default) > for parquet format plugin: > {noformat} > "parquet" : { > type: "parquet", > enableStringsSignedMinMax: true > } > {noformat} > This would apply to all parquet files of a given file plugin, including all > workspaces. > 2. If user wants to enable / disable allowing reading binary statistics for > old parquet files per session, session option > {{store.parquet.reader.strings_signed_min_max}} can be used. By default, it > has empty string value. Setting such option will take priority over config in > parquet format plugin. Option allows three values: 'true', 'false', '' (empty > string). > _Note: store.parquet.reader.strings_signed_min_max also can be set at system > level, thus it will apply to all parquet files in the system._ > The same config / session option will apply to allow reading binary > statistics from Drill metadata files generated prior to Drill 1.15.0. If > Drill metadata file was created prior to Drill 1.15.0 but for parquet files > created with Parquet library 1.10.0 and later, user would have to enable > config / session option or regenerate Drill metadata file with Drill 1.15.0 > or later, because from the metadata file we don't know if statistics is > stored correctly (prior Drill was writing reading and writing binary > statistics by default though did not use it). > When creating Drill metadata file with Drill 1.15.0 and later for old parquet > files, user should mind config / session option. If strings_signed_min_max is > enabled, Drill will store in the Drill metadata file binary statistics but > since metadata file was created with Drill 1.15.0 and later, Drill would read > it back disregarding the option (assuming that if statistics is present in > the Drill metadata file, it is correct). If user mistakenly enabled > strings_signed_min_max, he needs to disable it and
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686537#comment-16686537 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233454765 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/scan/ScanTestUtils.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan; + +import java.util.List; + +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.physical.impl.scan.project.ResolvedColumn; +import org.apache.drill.exec.physical.impl.scan.project.ResolvedTuple; +import org.apache.drill.exec.physical.impl.scan.project.ScanLevelProjection.ScanProjectionParser; +import org.apache.drill.exec.physical.impl.scan.project.SchemaLevelProjection.SchemaProjectionResolver; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.record.metadata.TupleSchema; +import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList; + +public class ScanTestUtils { + + /** + * Type-safe way to define a list of parsers. + * @param parsers + * @return + */ + + public static List parsers(ScanProjectionParser... parsers) { +return ImmutableList.copyOf(parsers); Review comment: Consider using java built-in utils rather than guava: here and below. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Merge scan projection framework into master > --- > > Key: DRILL-6791 > URL: https://issues.apache.org/jira/browse/DRILL-6791 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.15.0 > > > Merge the next set of "result set loader" code into master via a PR. This one > covers the "schema projection" mechanism which: > * Handles none (SELECT COUNT\(*)), some (SELECT a, b, x) and all (SELECT *) > projection. > * Handles null columns (for projection a column "x" that does not exist in > the base table.) > * Handles constant columns as used for file metadata (AKA "implicit" columns). > * Handle schema persistence: the need to reuse the same vectors across > different scanners > * Provides a framework for consuming externally-supplied metadata > * Since we don't yet have a way to provide "real" metadata, obtains metadata > hints from previous batches and from the projection list (a.b implies that > "a" is a map, c[0] implies that "c" is an array, etc.) > * Handles merging the set of data source columns and null columns to create > the final output batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686536#comment-16686536 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233453077 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/project/ResolvedTuple.java ## @@ -0,0 +1,427 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan.project; + +import java.util.ArrayList; +import java.util.List; + +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.record.VectorContainer; +import org.apache.drill.exec.memory.BufferAllocator; +import org.apache.drill.exec.physical.rowSet.ResultVectorCache; +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode; +import org.apache.drill.exec.vector.UInt4Vector; +import org.apache.drill.exec.vector.ValueVector; +import org.apache.drill.exec.vector.complex.AbstractMapVector; +import org.apache.drill.exec.vector.complex.MapVector; +import org.apache.drill.exec.vector.complex.RepeatedMapVector; + +import org.apache.drill.shaded.guava.com.google.common.annotations.VisibleForTesting; + +/** + * Drill rows are made up of a tree of tuples, with the row being the root + * tuple. Each tuple contains columns, some of which may be maps. This + * class represents each row or map in the output projection. + * + * Output columns within the tuple can be projected from the data source, + * might be null (requested columns that don't match a data source column) + * or might be a constant (such as an implicit column.) This class + * orchestrates assembling an output tuple from a collection of these + * three column types. (Though implicit columns appear only in the root + * tuple.) + * + * Null Handling + * + * The project list might reference a "missing" map if the project list + * includes, say, SELECT a.b.c but `a` does not exist + * in the data source. In this case, the column a is implied to be a map, + * so the projection mechanism will create a null map for `a` + * and `b`, and will create a null column for `c`. + * + * To accomplish this recursive null processing, each tuple is associated + * with a null builder. (The null builder can be null if projection is + * implicit with a wildcard; in such a case no null columns can occur. + * But, even here, with schema persistence, a SELECT * query + * may need null columns if a second file does not contain a column + * that appeared in a first file.) + * + * The null builder is bound to each tuple to allow vector persistence + * via the result vector cache. If we must create a null column + * `x` in two different readers, then the rules of Drill + * require that the same vector be used for both (or else a schema + * change is signaled.) The vector cache works by name (and type). + * Since maps may contain columns with the same names as other maps, + * the vector cache must be associated with each tuple. And, by extension, + * the null builder must also be associated with each tuple. + * + * Lifecycle + * + * The lifecycle of a resolved tuple is: + * + * The projection mechanism creates the output tuple, and its columns, + * by comparing the project list against the table schema. The result is + * a set of table, null, or constant columns. + * Once per schema change, the resolved tuple creates the output + * tuple by linking to vectors in their original locations. As it turns out, + * we can simply share the vectors; we don't need to transfer the buffers. + * To prepare for the transfer, the tuple asks the null column builder + * (if present) to build the required null columns. + * Once the output tuple is built, it can be used for any number of + * batches without further work. (The same vectors appear in the various inputs + * and the output, eliminating the need for any transfers.) + * Once per batch, the client must set the row count. This is needed for the + * output container, and for any "null" maps
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686539#comment-16686539 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233455585 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/scan/project/TestNullColumnLoader.java ## @@ -0,0 +1,329 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan.project; + +import static org.junit.Assert.assertSame; + +import java.util.ArrayList; +import java.util.List; + +import org.apache.drill.common.types.TypeProtos.DataMode; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.physical.impl.scan.project.NullColumnBuilder; +import org.apache.drill.exec.physical.impl.scan.project.NullColumnLoader; +import org.apache.drill.exec.physical.impl.scan.project.ResolvedNullColumn; +import org.apache.drill.exec.physical.rowSet.ResultVectorCache; +import org.apache.drill.exec.physical.rowSet.impl.NullResultVectorCacheImpl; +import org.apache.drill.exec.physical.rowSet.impl.ResultVectorCacheImpl; +import org.apache.drill.exec.record.BatchSchema; +import org.apache.drill.exec.record.VectorContainer; +import org.apache.drill.exec.vector.ValueVector; +import org.apache.drill.test.SubOperatorTest; +import org.apache.drill.test.rowSet.RowSet.SingleRowSet; +import org.apache.drill.test.rowSet.schema.SchemaBuilder; +import org.apache.drill.test.rowSet.RowSetComparison; +import org.junit.Test; + +/** + * Test the mechanism that handles all-null columns during projection. + * An all-null column is one projected in the query, but which does + * not actually exist in the underlying data source (or input + * operator.) + * + * In anticipation of having type information, this mechanism + * can create the classic nullable Int null column, or one of + * any other type and mode. + */ + +public class TestNullColumnLoader extends SubOperatorTest { + + private ResolvedNullColumn makeNullCol(String name, MajorType nullType) { + +// For this test, we don't need the projection, so just +// set it to null. + +return new ResolvedNullColumn(name, nullType, null, 0); + } + + private ResolvedNullColumn makeNullCol(String name) { +return makeNullCol(name, null); + } + + /** + * Test the simplest case: default null type, nothing in the vector + * cache. Specify no column type, the special NULL type, or a + * predefined type. Output types should be set accordingly. + */ + + @Test + public void testBasics() { + +final List defns = new ArrayList<>(); +defns.add(makeNullCol("unspecified", null)); +defns.add(makeNullCol("nullType", Types.optional(MinorType.NULL))); +defns.add(makeNullCol("specifiedOpt", Types.optional(MinorType.VARCHAR))); +defns.add(makeNullCol("specifiedReq", Types.required(MinorType.VARCHAR))); +defns.add(makeNullCol("specifiedArray", Types.repeated(MinorType.VARCHAR))); + +final ResultVectorCache cache = new NullResultVectorCacheImpl(fixture.allocator()); +final NullColumnLoader staticLoader = new NullColumnLoader(cache, defns, null, false); + +// Create a batch + +final VectorContainer output = staticLoader.load(2); + +// Verify values and types + +final BatchSchema expectedSchema = new SchemaBuilder() +.add("unspecified", NullColumnLoader.DEFAULT_NULL_TYPE) +.add("nullType", NullColumnLoader.DEFAULT_NULL_TYPE) +.addNullable("specifiedOpt", MinorType.VARCHAR) +.addNullable("specifiedReq", MinorType.VARCHAR) +.addArray("specifiedArray", MinorType.VARCHAR) +.build(); +final SingleRowSet expected = fixture.rowSetBuilder(expectedSchema) +.addRow(null, null, null, null, new String[] {}) +.addRow(null, null, null, null, new String[] {}) +.build(); + +new
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686543#comment-16686543 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233452637 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/project/ResolvedTuple.java ## @@ -0,0 +1,427 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan.project; + +import java.util.ArrayList; +import java.util.List; + +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.record.VectorContainer; +import org.apache.drill.exec.memory.BufferAllocator; +import org.apache.drill.exec.physical.rowSet.ResultVectorCache; +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode; +import org.apache.drill.exec.vector.UInt4Vector; +import org.apache.drill.exec.vector.ValueVector; +import org.apache.drill.exec.vector.complex.AbstractMapVector; +import org.apache.drill.exec.vector.complex.MapVector; +import org.apache.drill.exec.vector.complex.RepeatedMapVector; + +import org.apache.drill.shaded.guava.com.google.common.annotations.VisibleForTesting; + +/** + * Drill rows are made up of a tree of tuples, with the row being the root + * tuple. Each tuple contains columns, some of which may be maps. This + * class represents each row or map in the output projection. + * + * Output columns within the tuple can be projected from the data source, + * might be null (requested columns that don't match a data source column) + * or might be a constant (such as an implicit column.) This class + * orchestrates assembling an output tuple from a collection of these + * three column types. (Though implicit columns appear only in the root + * tuple.) + * + * Null Handling + * + * The project list might reference a "missing" map if the project list + * includes, say, SELECT a.b.c but `a` does not exist + * in the data source. In this case, the column a is implied to be a map, + * so the projection mechanism will create a null map for `a` + * and `b`, and will create a null column for `c`. + * + * To accomplish this recursive null processing, each tuple is associated + * with a null builder. (The null builder can be null if projection is + * implicit with a wildcard; in such a case no null columns can occur. + * But, even here, with schema persistence, a SELECT * query + * may need null columns if a second file does not contain a column + * that appeared in a first file.) + * + * The null builder is bound to each tuple to allow vector persistence + * via the result vector cache. If we must create a null column + * `x` in two different readers, then the rules of Drill + * require that the same vector be used for both (or else a schema + * change is signaled.) The vector cache works by name (and type). + * Since maps may contain columns with the same names as other maps, + * the vector cache must be associated with each tuple. And, by extension, + * the null builder must also be associated with each tuple. + * + * Lifecycle + * + * The lifecycle of a resolved tuple is: + * + * The projection mechanism creates the output tuple, and its columns, + * by comparing the project list against the table schema. The result is + * a set of table, null, or constant columns. + * Once per schema change, the resolved tuple creates the output + * tuple by linking to vectors in their original locations. As it turns out, + * we can simply share the vectors; we don't need to transfer the buffers. + * To prepare for the transfer, the tuple asks the null column builder + * (if present) to build the required null columns. + * Once the output tuple is built, it can be used for any number of + * batches without further work. (The same vectors appear in the various inputs + * and the output, eliminating the need for any transfers.) + * Once per batch, the client must set the row count. This is needed for the + * output container, and for any "null" maps
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686542#comment-16686542 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233451408 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/project/ResolvedColumn.java ## @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan.project; + +import org.apache.drill.exec.record.MaterializedField; + +/** + * A resolved column has a name, and a specification for how to project + * data from a source vector to a vector in the final output container. + * Describes the projection of a single column from + * an input to an output batch. + * + * Although the table schema mechanism uses the newer "metadata" + * mechanism, resolved columns revert back to the original + * {@link MajorType} and {@link MaterializedField} mechanism used + * by the rest of Drill. Doing so loses a bit of additional + * information, but at present there is no way to export that information + * along with a serialized record batch; each operator must rediscover + * it after deserialization. + */ + +public abstract class ResolvedColumn implements ColumnProjection { + + public final VectorSource source; + public final int sourceIndex; + + public ResolvedColumn(VectorSource source, int sourceIndex) { +this.source = source; +this.sourceIndex = sourceIndex; + } + + public VectorSource source() { return source; } + + public int sourceIndex() { return sourceIndex; } + + /** + * Return the type of this column. Used primarily by the schema smoothing + * mechanism. + * + * @return Review comment: Move description here to avoid warning. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Merge scan projection framework into master > --- > > Key: DRILL-6791 > URL: https://issues.apache.org/jira/browse/DRILL-6791 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.15.0 > > > Merge the next set of "result set loader" code into master via a PR. This one > covers the "schema projection" mechanism which: > * Handles none (SELECT COUNT\(*)), some (SELECT a, b, x) and all (SELECT *) > projection. > * Handles null columns (for projection a column "x" that does not exist in > the base table.) > * Handles constant columns as used for file metadata (AKA "implicit" columns). > * Handle schema persistence: the need to reuse the same vectors across > different scanners > * Provides a framework for consuming externally-supplied metadata > * Since we don't yet have a way to provide "real" metadata, obtains metadata > hints from previous batches and from the projection list (a.b implies that > "a" is a map, c[0] implies that "c" is an array, etc.) > * Handles merging the set of data source columns and null columns to create > the final output batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686535#comment-16686535 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233454585 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/scan/ScanTestUtils.java ## @@ -0,0 +1,65 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan; + +import java.util.List; + +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.common.types.Types; +import org.apache.drill.exec.physical.impl.scan.project.ResolvedColumn; +import org.apache.drill.exec.physical.impl.scan.project.ResolvedTuple; +import org.apache.drill.exec.physical.impl.scan.project.ScanLevelProjection.ScanProjectionParser; +import org.apache.drill.exec.physical.impl.scan.project.SchemaLevelProjection.SchemaProjectionResolver; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.record.metadata.TupleSchema; +import org.apache.drill.shaded.guava.com.google.common.collect.ImmutableList; + +public class ScanTestUtils { + + /** + * Type-safe way to define a list of parsers. + * @param parsers + * @return Review comment: add description This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Merge scan projection framework into master > --- > > Key: DRILL-6791 > URL: https://issues.apache.org/jira/browse/DRILL-6791 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.15.0 > > > Merge the next set of "result set loader" code into master via a PR. This one > covers the "schema projection" mechanism which: > * Handles none (SELECT COUNT\(*)), some (SELECT a, b, x) and all (SELECT *) > projection. > * Handles null columns (for projection a column "x" that does not exist in > the base table.) > * Handles constant columns as used for file metadata (AKA "implicit" columns). > * Handle schema persistence: the need to reuse the same vectors across > different scanners > * Provides a framework for consuming externally-supplied metadata > * Since we don't yet have a way to provide "real" metadata, obtains metadata > hints from previous batches and from the projection list (a.b implies that > "a" is a map, c[0] implies that "c" is an array, etc.) > * Handles merging the set of data source columns and null columns to create > the final output batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686544#comment-16686544 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233454002 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/project/SmoothingProjection.java ## @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan.project; + +import java.util.ArrayList; +import java.util.List; + +import org.apache.drill.common.types.TypeProtos.DataMode; +import org.apache.drill.exec.physical.impl.scan.project.SchemaSmoother.IncompatibleSchemaException; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.record.metadata.TupleMetadata; + +/** + * Resolve a table schema against the prior schema. This works only if the + * types match and if all columns in the table schema already appear in the + * prior schema. + * + * Consider this an experimental mechanism. The hope was that, with clever + * techniques, we could "smooth over" some of the issues that cause schema + * change events in Drill. As it turned out, however, creating this mechanism + * revealed that it is not possible, even in theory, to handle most schema + * changes because of the time dimension: + * + * An even in a later batch may provide information that would have + * caused us to make a different decision in an earlier batch. For example, + * we are asked for column `foo`, did not see such a column in the first + * batch, block or file, guessed some type, and later saw that the column + * was of a different type. We can't "time travel" to tell our earlier + * selves, nor, when we make the initial type decision, can we jump to + * the future to see what type we'll discover. + * Readers in this fragment may see column `foo` but readers in + * another fragment read files/blocks that don't have that column. The + * two readers cannot communicate to agree on a type. + * + * + * What this mechanism can do is make decisions based on history: when a + * column appears, we can adjust its type a bit to try to avoid an + * unnecessary change. For example, if a prior file in this scan saw + * `foo` as nullable Varchar, but the present file has the column as + * requied Varchar, we can use the more general nullable form. But, + * again, the "can't predict the future" bites us: we can handle a + * nullable-to-required column change, but not visa-versa. + * + * What this mechanism will tell the careful reader is that the only + * general solution to the schema-change problem is to now the full + * schema up front: for the planner to be told the schema and to + * communicate that schema to all readers so that all readers agree + * on the final schema. + * + * When that is done, the techniques shown here can be used to adjust + * any per-file variation of schema to match the up-front schema. + */ + +public class SmoothingProjection extends SchemaLevelProjection { + + protected final List rewrittenFields = new ArrayList<>(); + + public SmoothingProjection(ScanLevelProjection scanProj, + TupleMetadata tableSchema, + ResolvedTuple priorSchema, + ResolvedTuple outputTuple, + List resolvers) throws IncompatibleSchemaException { + +super(resolvers); + +for (ResolvedColumn priorCol : priorSchema.columns()) { + switch (priorCol.nodeType()) { + case ResolvedTableColumn.ID: Review comment: indent This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Merge scan projection framework into master > --- > > Key: DRILL-6791 > URL:
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686541#comment-16686541 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233458346 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/scan/project/TestSchemaSmoothing.java ## @@ -0,0 +1,681 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan.project; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNotEquals; +import static org.junit.Assert.assertTrue; +import static org.junit.Assert.fail; + +import java.util.List; + +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.physical.impl.protocol.SchemaTracker; +import org.apache.drill.exec.physical.impl.scan.ScanTestUtils; +import org.apache.drill.exec.physical.impl.scan.project.NullColumnBuilder; +import org.apache.drill.exec.physical.impl.scan.project.ResolvedColumn; +import org.apache.drill.exec.physical.impl.scan.project.ResolvedTuple.ResolvedRow; +import org.apache.drill.exec.physical.impl.scan.project.ScanLevelProjection; +import org.apache.drill.exec.physical.impl.scan.project.ScanSchemaOrchestrator; +import org.apache.drill.exec.physical.impl.scan.project.ScanSchemaOrchestrator.ReaderSchemaOrchestrator; +import org.apache.drill.exec.physical.impl.scan.project.SchemaSmoother; +import org.apache.drill.exec.physical.impl.scan.project.SchemaSmoother.IncompatibleSchemaException; +import org.apache.drill.exec.physical.impl.scan.project.SmoothingProjection; +import org.apache.drill.exec.physical.impl.scan.project.WildcardSchemaProjection; +import org.apache.drill.exec.physical.rowSet.ResultSetLoader; +import org.apache.drill.exec.physical.rowSet.impl.RowSetTestUtils; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.test.SubOperatorTest; +import org.apache.drill.test.rowSet.RowSet.SingleRowSet; +import org.apache.drill.test.rowSet.RowSetComparison; +import org.apache.drill.test.rowSet.schema.SchemaBuilder; +import org.junit.Test; + +/** + * Tests schema smoothing at the schema projection level. + * This level handles reusing prior types when filling null + * values. But, because no actual vectors are involved, it + * does not handle the schema chosen for a table ahead of + * time, only the schema as it is merged with prior schema to + * detect missing columns. + * + * Focuses on the SmoothingProjection class itself. + * + * Note that, at present, schema smoothing does not work for entire + * maps. That is, if file 1 has, say {a: {b: 10, c: "foo"}} + * and file 2 has, say, {a: null}, then schema smoothing does + * not currently know how to recreate the map. The same is true of + * lists and unions. Handling such cases is complex and is probably + * better handled via a system that allows the user to specify their + * intent by providing a schema to apply to the two files. + */ + +public class TestSchemaSmoothing extends SubOperatorTest { + + /** + * Low-level test of the smoothing projection, including the exceptions + * it throws when things are not going its way. + */ + + @Test + public void testSmoothingProjection() { +final ScanLevelProjection scanProj = new ScanLevelProjection( +RowSetTestUtils.projectAll(), +ScanTestUtils.parsers()); + +// Table 1: (a: nullable bigint, b) + +final TupleMetadata schema1 = new SchemaBuilder() +.addNullable("a", MinorType.BIGINT) +.addNullable("b", MinorType.VARCHAR) +.add("c", MinorType.FLOAT8) +.buildSchema(); +ResolvedRow priorSchema; +{ + final NullColumnBuilder builder = new NullColumnBuilder(null, false); + final ResolvedRow rootTuple = new ResolvedRow(builder); + new WildcardSchemaProjection( + scanProj, schema1, rootTuple, + ScanTestUtils.resolvers()); + priorSchema = rootTuple; +} + +// Table 2: (a:
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686534#comment-16686534 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233450250 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/project/MetadataManager.java ## @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan.project; + +import org.apache.drill.exec.physical.impl.scan.project.ScanLevelProjection.ScanProjectionParser; +import org.apache.drill.exec.physical.impl.scan.project.SchemaLevelProjection.SchemaProjectionResolver; +import org.apache.drill.exec.physical.rowSet.ResultVectorCache; + +/** + * Queries can contain a wildcard (*), table columns, or special + * system-defined columns (the file metadata columns AKA implicit + * columns, the `columns` column of CSV, etc.). + * + * This class provides a generalized way of handling such extended + * columns. That is, this handles metadata for columns defined by + * the scan or file; columns defined by the table (the actual + * data metadata) is handled elsewhere. + * + * Objects of this interface are driven by the projection processing + * framework which provides a vector cache from which to obtain + * materialized columns. The implementation must provide a projection + * parser to pick out the columns which this object handles. + * + * A better name might be ImplicitMetadataManager to signify that Review comment: Agree, let's rename to avoid the confusion in future. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Merge scan projection framework into master > --- > > Key: DRILL-6791 > URL: https://issues.apache.org/jira/browse/DRILL-6791 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.15.0 > > > Merge the next set of "result set loader" code into master via a PR. This one > covers the "schema projection" mechanism which: > * Handles none (SELECT COUNT\(*)), some (SELECT a, b, x) and all (SELECT *) > projection. > * Handles null columns (for projection a column "x" that does not exist in > the base table.) > * Handles constant columns as used for file metadata (AKA "implicit" columns). > * Handle schema persistence: the need to reuse the same vectors across > different scanners > * Provides a framework for consuming externally-supplied metadata > * Since we don't yet have a way to provide "real" metadata, obtains metadata > hints from previous batches and from the projection list (a.b implies that > "a" is a map, c[0] implies that "c" is an array, etc.) > * Handles merging the set of data source columns and null columns to create > the final output batch. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686538#comment-16686538 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233454141 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/project/SmoothingProjection.java ## @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan.project; + +import java.util.ArrayList; +import java.util.List; + +import org.apache.drill.common.types.TypeProtos.DataMode; +import org.apache.drill.exec.physical.impl.scan.project.SchemaSmoother.IncompatibleSchemaException; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.record.metadata.TupleMetadata; + +/** + * Resolve a table schema against the prior schema. This works only if the + * types match and if all columns in the table schema already appear in the + * prior schema. + * + * Consider this an experimental mechanism. The hope was that, with clever + * techniques, we could "smooth over" some of the issues that cause schema + * change events in Drill. As it turned out, however, creating this mechanism + * revealed that it is not possible, even in theory, to handle most schema + * changes because of the time dimension: + * + * An even in a later batch may provide information that would have + * caused us to make a different decision in an earlier batch. For example, + * we are asked for column `foo`, did not see such a column in the first + * batch, block or file, guessed some type, and later saw that the column + * was of a different type. We can't "time travel" to tell our earlier + * selves, nor, when we make the initial type decision, can we jump to + * the future to see what type we'll discover. + * Readers in this fragment may see column `foo` but readers in + * another fragment read files/blocks that don't have that column. The + * two readers cannot communicate to agree on a type. + * + * + * What this mechanism can do is make decisions based on history: when a + * column appears, we can adjust its type a bit to try to avoid an + * unnecessary change. For example, if a prior file in this scan saw + * `foo` as nullable Varchar, but the present file has the column as + * requied Varchar, we can use the more general nullable form. But, + * again, the "can't predict the future" bites us: we can handle a + * nullable-to-required column change, but not visa-versa. + * + * What this mechanism will tell the careful reader is that the only + * general solution to the schema-change problem is to now the full + * schema up front: for the planner to be told the schema and to + * communicate that schema to all readers so that all readers agree + * on the final schema. + * + * When that is done, the techniques shown here can be used to adjust + * any per-file variation of schema to match the up-front schema. + */ + +public class SmoothingProjection extends SchemaLevelProjection { + + protected final List rewrittenFields = new ArrayList<>(); + + public SmoothingProjection(ScanLevelProjection scanProj, + TupleMetadata tableSchema, + ResolvedTuple priorSchema, + ResolvedTuple outputTuple, + List resolvers) throws IncompatibleSchemaException { + +super(resolvers); + +for (ResolvedColumn priorCol : priorSchema.columns()) { + switch (priorCol.nodeType()) { + case ResolvedTableColumn.ID: + case ResolvedNullColumn.ID: +// TODO: To fix this, the null column loader must declare Review comment: Please explain this todo This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Merge scan projection framework into master >
[jira] [Commented] (DRILL-6791) Merge scan projection framework into master
[ https://issues.apache.org/jira/browse/DRILL-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686533#comment-16686533 ] ASF GitHub Bot commented on DRILL-6791: --- arina-ielchiieva commented on a change in pull request #1501: DRILL-6791: Scan projection framework URL: https://github.com/apache/drill/pull/1501#discussion_r233449683 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/scan/project/ExplicitSchemaProjection.java ## @@ -0,0 +1,253 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.scan.project; + +import java.util.List; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.physical.rowSet.project.RequestedTuple; +import org.apache.drill.exec.physical.rowSet.project.RequestedTuple.RequestedColumn; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.record.metadata.ColumnMetadata; +import org.apache.drill.exec.record.metadata.TupleMetadata; + +/** + * Perform a schema projection for the case of an explicit list of + * projected columns. Example: SELECT a, b, c. + * + * An explicit projection starts with the requested set of columns, + * then looks in the table schema to find matches. That is, it is + * driven by the query itself. + * + * An explicit projection may include columns that do not exist in + * the source schema. In this case, we fill in null columns for + * unmatched projections. + */ + +public class ExplicitSchemaProjection extends SchemaLevelProjection { + + public ExplicitSchemaProjection(ScanLevelProjection scanProj, + TupleMetadata tableSchema, + ResolvedTuple rootTuple, + List resolvers) { +super(resolvers); +resolveRootTuple(scanProj, rootTuple, tableSchema); + } + + private void resolveRootTuple(ScanLevelProjection scanProj, + ResolvedTuple rootTuple, + TupleMetadata tableSchema) { +for (ColumnProjection col : scanProj.columns()) { + if (col.nodeType() == UnresolvedColumn.UNRESOLVED) { +resolveColumn(rootTuple, ((UnresolvedColumn) col).element(), tableSchema); + } else { +resolveSpecial(rootTuple, col, tableSchema); + } +} + } + + private void resolveColumn(ResolvedTuple outputTuple, + RequestedColumn inputCol, TupleMetadata tableSchema) { +int tableColIndex = tableSchema.index(inputCol.name()); +if (tableColIndex == -1) { + resolveNullColumn(outputTuple, inputCol); +} else { + resolveTableColumn(outputTuple, inputCol, + tableSchema.metadata(tableColIndex), + tableColIndex); +} + } + + private void resolveTableColumn(ResolvedTuple outputTuple, + RequestedColumn requestedCol, + ColumnMetadata column, int sourceIndex) { + +// Is the requested column implied to be a map? +// A requested column is a map if the user requests x.y and we +// are resolving column x. The presence of y as a member implies +// that x is a map. + +if (requestedCol.isTuple()) { + resolveMap(outputTuple, requestedCol, column, sourceIndex); +} + +// Is the requested column implied to be an array? +// This occurs when the projection list contains at least one +// array index reference such as x[10]. + +else if (requestedCol.isArray()) { + resolveArray(outputTuple, requestedCol, column, sourceIndex); +} + +// A plain old column. Might be an array or a map, but if +// so, the request list just mentions it by name without implying +// the column type. That is, the project list just contains x +// by itself. + +else { + projectTableColumn(outputTuple, requestedCol, column, sourceIndex); +} + } + + private void resolveMap(ResolvedTuple outputTuple, + RequestedColumn requestedCol, ColumnMetadata column, + int sourceIndex) { + +// If the actual column isn't a map, then the request is invalid. + +if (! column.isMap()) { + throw UserException +.validationError() +.message("Project list implies a map
[jira] [Commented] (DRILL-6847) Add Query Metadata to RESTful Interface
[ https://issues.apache.org/jira/browse/DRILL-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686531#comment-16686531 ] ASF GitHub Bot commented on DRILL-6847: --- arina-ielchiieva commented on a change in pull request #1539: DRILL-6847: Add Query Metadata to RESTful Interface URL: https://github.com/apache/drill/pull/1539#discussion_r233457684 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/WebUserConnection.java ## @@ -106,7 +110,10 @@ public void sendData(RpcOutcomeListener listener, QueryWritableBatch result // TODO: Clean: DRILL-2933: That load(...) no longer throws // SchemaChangeException, so check/clean catch clause below. for (int i = 0; i < loader.getSchema().getFieldCount(); ++i) { - columns.add(loader.getSchema().getColumn(i).getName()); + + MaterializedField col = loader.getSchema().getColumn(i); + columns.add(col.getName()); + metadata.add(col.getType().getMinorType().name()); Review comment: 1. Duplicating column name does not make sense. 2. You may not output precision and scale is they are absent, depends in which Object you plan to deserialize this information. 3. Look at major type, for example. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Query Metadata to RESTful Interface > --- > > Key: DRILL-6847 > URL: https://issues.apache.org/jira/browse/DRILL-6847 > Project: Apache Drill > Issue Type: Improvement > Components: Metadata >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Minor > > The Drill RESTful interface does not return the structure of the query > results. This makes integrating Drill with other BI tools difficult because > they do not know what kind of data to expect. > This PR adds a new section to the results called Metadata which contains a > list of the minor types of all the columns returned. > The query below will now return the following in the RESTful interface: > {code:sql} > SELECT CAST( employee_id AS INT) AS employee_id, > full_name, > first_name, > last_name, > CAST( position_id AS BIGINT) AS position_id, > position_title > FROM cp.`employee.json` LIMIT 2 > {code} > {code} > { > "queryId": "2414bf3f-b4f4-d4df-825f-73dfb3a56681", > "columns": [ > "employee_id", > "full_name", > "first_name", > "last_name", > "position_id", > "position_title" > ], > "metadata": [ > "INT", > "VARCHAR", > "VARCHAR", > "VARCHAR", > "BIGINT", > "VARCHAR" > ], > "rows": [ > { > "full_name": "Sheri Nowmer", > "employee_id": "1", > "last_name": "Nowmer", > "position_title": "President", > "first_name": "Sheri", > "position_id": "1" > }, > { > "full_name": "Derrick Whelply", > "employee_id": "2", > "last_name": "Whelply", > "position_title": "VP Country Manager", > "first_name": "Derrick", > "position_id": "2" > } > ] > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6847) Add Query Metadata to RESTful Interface
[ https://issues.apache.org/jira/browse/DRILL-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686523#comment-16686523 ] ASF GitHub Bot commented on DRILL-6847: --- cgivre commented on a change in pull request #1539: DRILL-6847: Add Query Metadata to RESTful Interface URL: https://github.com/apache/drill/pull/1539#discussion_r233456389 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/WebUserConnection.java ## @@ -106,7 +110,10 @@ public void sendData(RpcOutcomeListener listener, QueryWritableBatch result // TODO: Clean: DRILL-2933: That load(...) no longer throws // SchemaChangeException, so check/clean catch clause below. for (int i = 0; i < loader.getSchema().getFieldCount(); ++i) { - columns.add(loader.getSchema().getColumn(i).getName()); + + MaterializedField col = loader.getSchema().getColumn(i); + columns.add(col.getName()); + metadata.add(col.getType().getMinorType().name()); Review comment: How would you recommend designing that? I was trying to keep this PR relatively simple, and backwards compatible, but one option might be to make the metadata a little duplicative so something like: ``` "metadata": [{ "name": "price", "type": "FLOAT4" "precision": "scale" },{ "name": "customer", "type": "VARCHAR" ... ] ``` Do you know off hand where the precision/scale or any other attributes of the columns can be accessed? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Query Metadata to RESTful Interface > --- > > Key: DRILL-6847 > URL: https://issues.apache.org/jira/browse/DRILL-6847 > Project: Apache Drill > Issue Type: Improvement > Components: Metadata >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Minor > > The Drill RESTful interface does not return the structure of the query > results. This makes integrating Drill with other BI tools difficult because > they do not know what kind of data to expect. > This PR adds a new section to the results called Metadata which contains a > list of the minor types of all the columns returned. > The query below will now return the following in the RESTful interface: > {code:sql} > SELECT CAST( employee_id AS INT) AS employee_id, > full_name, > first_name, > last_name, > CAST( position_id AS BIGINT) AS position_id, > position_title > FROM cp.`employee.json` LIMIT 2 > {code} > {code} > { > "queryId": "2414bf3f-b4f4-d4df-825f-73dfb3a56681", > "columns": [ > "employee_id", > "full_name", > "first_name", > "last_name", > "position_id", > "position_title" > ], > "metadata": [ > "INT", > "VARCHAR", > "VARCHAR", > "VARCHAR", > "BIGINT", > "VARCHAR" > ], > "rows": [ > { > "full_name": "Sheri Nowmer", > "employee_id": "1", > "last_name": "Nowmer", > "position_title": "President", > "first_name": "Sheri", > "position_id": "1" > }, > { > "full_name": "Derrick Whelply", > "employee_id": "2", > "last_name": "Whelply", > "position_title": "VP Country Manager", > "first_name": "Derrick", > "position_id": "2" > } > ] > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6847) Add Query Metadata to RESTful Interface
[ https://issues.apache.org/jira/browse/DRILL-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686436#comment-16686436 ] ASF GitHub Bot commented on DRILL-6847: --- arina-ielchiieva commented on a change in pull request #1539: DRILL-6847: Add Query Metadata to RESTful Interface URL: https://github.com/apache/drill/pull/1539#discussion_r233423505 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/WebUserConnection.java ## @@ -106,7 +110,10 @@ public void sendData(RpcOutcomeListener listener, QueryWritableBatch result // TODO: Clean: DRILL-2933: That load(...) no longer throws // SchemaChangeException, so check/clean catch clause below. for (int i = 0; i < loader.getSchema().getFieldCount(); ++i) { - columns.add(loader.getSchema().getColumn(i).getName()); + + MaterializedField col = loader.getSchema().getColumn(i); + columns.add(col.getName()); + metadata.add(col.getType().getMinorType().name()); Review comment: I see but though it is not your use case, I think we should consider shipping not only String type but as well information about precision and scale for those who might need it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Query Metadata to RESTful Interface > --- > > Key: DRILL-6847 > URL: https://issues.apache.org/jira/browse/DRILL-6847 > Project: Apache Drill > Issue Type: Improvement > Components: Metadata >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Minor > > The Drill RESTful interface does not return the structure of the query > results. This makes integrating Drill with other BI tools difficult because > they do not know what kind of data to expect. > This PR adds a new section to the results called Metadata which contains a > list of the minor types of all the columns returned. > The query below will now return the following in the RESTful interface: > {code:sql} > SELECT CAST( employee_id AS INT) AS employee_id, > full_name, > first_name, > last_name, > CAST( position_id AS BIGINT) AS position_id, > position_title > FROM cp.`employee.json` LIMIT 2 > {code} > {code} > { > "queryId": "2414bf3f-b4f4-d4df-825f-73dfb3a56681", > "columns": [ > "employee_id", > "full_name", > "first_name", > "last_name", > "position_id", > "position_title" > ], > "metadata": [ > "INT", > "VARCHAR", > "VARCHAR", > "VARCHAR", > "BIGINT", > "VARCHAR" > ], > "rows": [ > { > "full_name": "Sheri Nowmer", > "employee_id": "1", > "last_name": "Nowmer", > "position_title": "President", > "first_name": "Sheri", > "position_id": "1" > }, > { > "full_name": "Derrick Whelply", > "employee_id": "2", > "last_name": "Whelply", > "position_title": "VP Country Manager", > "first_name": "Derrick", > "position_id": "2" > } > ] > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6847) Add Query Metadata to RESTful Interface
[ https://issues.apache.org/jira/browse/DRILL-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686379#comment-16686379 ] ASF GitHub Bot commented on DRILL-6847: --- cgivre commented on a change in pull request #1539: DRILL-6847: Add Query Metadata to RESTful Interface URL: https://github.com/apache/drill/pull/1539#discussion_r233405972 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/WebUserConnection.java ## @@ -106,7 +110,10 @@ public void sendData(RpcOutcomeListener listener, QueryWritableBatch result // TODO: Clean: DRILL-2933: That load(...) no longer throws // SchemaChangeException, so check/clean catch clause below. for (int i = 0; i < loader.getSchema().getFieldCount(); ++i) { - columns.add(loader.getSchema().getColumn(i).getName()); + + MaterializedField col = loader.getSchema().getColumn(i); + columns.add(col.getName()); + metadata.add(col.getType().getMinorType().name()); Review comment: Hi @arina-ielchiieva , The use case I had in mind was integrating Drill with SQLPad and Apache Superset. In these instances basically, the UI needed to know if a field was numeric, temporal of any sort, or text so that it could render visualizations properly. I'm sure there are other use cases out there, but I know that for me at least, this was a major blocker in getting Drill to work with the various BI tools. The JDBC interface provided this information, but the RESTful interface did not, so I had to resort to hackery. So to answer your question, it might be useful for other use cases to provide precision and scale, but for the one I had in mind, that would not be helpful. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Query Metadata to RESTful Interface > --- > > Key: DRILL-6847 > URL: https://issues.apache.org/jira/browse/DRILL-6847 > Project: Apache Drill > Issue Type: Improvement > Components: Metadata >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Minor > > The Drill RESTful interface does not return the structure of the query > results. This makes integrating Drill with other BI tools difficult because > they do not know what kind of data to expect. > This PR adds a new section to the results called Metadata which contains a > list of the minor types of all the columns returned. > The query below will now return the following in the RESTful interface: > {code:sql} > SELECT CAST( employee_id AS INT) AS employee_id, > full_name, > first_name, > last_name, > CAST( position_id AS BIGINT) AS position_id, > position_title > FROM cp.`employee.json` LIMIT 2 > {code} > {code} > { > "queryId": "2414bf3f-b4f4-d4df-825f-73dfb3a56681", > "columns": [ > "employee_id", > "full_name", > "first_name", > "last_name", > "position_id", > "position_title" > ], > "metadata": [ > "INT", > "VARCHAR", > "VARCHAR", > "VARCHAR", > "BIGINT", > "VARCHAR" > ], > "rows": [ > { > "full_name": "Sheri Nowmer", > "employee_id": "1", > "last_name": "Nowmer", > "position_title": "President", > "first_name": "Sheri", > "position_id": "1" > }, > { > "full_name": "Derrick Whelply", > "employee_id": "2", > "last_name": "Whelply", > "position_title": "VP Country Manager", > "first_name": "Derrick", > "position_id": "2" > } > ] > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6744) Support filter push down for varchar / decimal data types
[ https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686353#comment-16686353 ] ASF GitHub Bot commented on DRILL-6744: --- vvysotskyi commented on a change in pull request #1537: DRILL-6744: Support varchar and decimal push down URL: https://github.com/apache/drill/pull/1537#discussion_r233135431 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetMetaStatCollector.java ## @@ -132,62 +129,163 @@ public ParquetMetaStatCollector(ParquetTableMetadataBase parquetTableMetadata, } /** - * Builds column statistics using given primitiveType, originalType, scale, - * precision, numNull, min and max values. + * Helper class that creates parquet {@link ColumnStatistics} based on given + * min and max values, type, number of nulls, precision and scale. * - * @param min min value for statistics - * @param max max value for statistics - * @param numNullsnum_nulls for statistics - * @param primitiveType type that determines statistics class - * @param originalTypetype that determines statistics class - * @param scale scale value (used for DECIMAL type) - * @param precision precision value (used for DECIMAL type) - * @return column statistics */ - private ColumnStatistics getStat(Object min, Object max, long numNulls, - PrimitiveType.PrimitiveTypeName primitiveType, OriginalType originalType, - int scale, int precision) { -Statistics stat = Statistics.getStatsBasedOnType(primitiveType); -Statistics convertedStat = stat; - -TypeProtos.MajorType type = ParquetReaderUtility.getType(primitiveType, originalType, scale, precision); -stat.setNumNulls(numNulls); - -if (min != null && max != null ) { - switch (type.getMinorType()) { - case INT : - case TIME: -((IntStatistics) stat).setMinMax(Integer.parseInt(min.toString()), Integer.parseInt(max.toString())); -break; - case BIGINT: - case TIMESTAMP: -((LongStatistics) stat).setMinMax(Long.parseLong(min.toString()), Long.parseLong(max.toString())); -break; - case FLOAT4: -((FloatStatistics) stat).setMinMax(Float.parseFloat(min.toString()), Float.parseFloat(max.toString())); -break; - case FLOAT8: -((DoubleStatistics) stat).setMinMax(Double.parseDouble(min.toString()), Double.parseDouble(max.toString())); -break; - case DATE: -convertedStat = new LongStatistics(); -convertedStat.setNumNulls(stat.getNumNulls()); -final long minMS = convertToDrillDateValue(Integer.parseInt(min.toString())); -final long maxMS = convertToDrillDateValue(Integer.parseInt(max.toString())); -((LongStatistics) convertedStat ).setMinMax(minMS, maxMS); -break; - case BIT: -((BooleanStatistics) stat).setMinMax(Boolean.parseBoolean(min.toString()), Boolean.parseBoolean(max.toString())); -break; - default: - } + private static class ColumnStatisticsBuilder { + +private Object min; +private Object max; +private long numNulls; +private PrimitiveType.PrimitiveTypeName primitiveType; +private OriginalType originalType; +private int scale; +private int precision; + +static ColumnStatisticsBuilder builder() { + return new ColumnStatisticsBuilder(); } -return new ColumnStatistics(convertedStat, type); - } +ColumnStatisticsBuilder setMin(Object min) { + this.min = min; + return this; +} + +ColumnStatisticsBuilder setMax(Object max) { + this.max = max; + return this; +} + +ColumnStatisticsBuilder setNumNulls(long numNulls) { + this.numNulls = numNulls; + return this; +} + +ColumnStatisticsBuilder setPrimitiveType(PrimitiveType.PrimitiveTypeName primitiveType) { + this.primitiveType = primitiveType; + return this; +} + +ColumnStatisticsBuilder setOriginalType(OriginalType originalType) { + this.originalType = originalType; + return this; +} - private static long convertToDrillDateValue(int dateValue) { +ColumnStatisticsBuilder setScale(int scale) { + this.scale = scale; + return this; +} + +ColumnStatisticsBuilder setPrecision(int precision) { + this.precision = precision; + return this; +} + + +/** + * Builds column statistics using given primitive and original types, + * scale, precision, number of nulls, min and max values. + * Min and max values for binary statistics are set only if allowed. + * + * @return column statistics + */ +ColumnStatistics build() { + Statistics stat = Statistics.getStatsBasedOnType(primitiveType); + Statistics convertedStat
[jira] [Commented] (DRILL-6744) Support filter push down for varchar / decimal data types
[ https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686349#comment-16686349 ] ASF GitHub Bot commented on DRILL-6744: --- vvysotskyi commented on a change in pull request #1537: DRILL-6744: Support varchar and decimal push down URL: https://github.com/apache/drill/pull/1537#discussion_r233097319 ## File path: common/src/main/java/org/apache/drill/common/VersionUtil.java ## @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.common; + +import org.apache.maven.artifact.versioning.DefaultArtifactVersion; + +/** + * Utility class for project version. + */ +public class VersionUtil { Review comment: Looks like `hadoop-common` has the class which does similar things: `org.apache.hadoop.util.VersionUtil`. Can we replace this class with hadoop one? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support filter push down for varchar / decimal data types > - > > Key: DRILL-6744 > URL: https://issues.apache.org/jira/browse/DRILL-6744 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.14.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.15.0 > > > Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly > stored varchar / decimal min / max statistics is resolved, we should add > support for varchar / decimal filter push down. Only files created with > parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In > cases if user knows that prior created files have correct min / max > statistics (i.e. user exactly knows that data in binary columns in ASCII (not > UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to > enable filter push down. > *Description* > _Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._ > *Varchar Partition Pruning* > Varchar Pruning will work for files generated prior and after Parquet 1.10.0 > version, since to enable partition pruning both min and max values should be > the same and there are no issues with incorrectly stored statistics for > binary data for the same min and max values. Partition pruning using Drill > metadata files will also work, no matter when metadata file was created > (prior or after Drill 1.15.0). > Partition pruning won't work for files where partition is null due to > PARQUET-1341, issue will be fixed in Parquet 1.11.0. > *Varchar Filter Push Down* > Varchar filter push down will work for parquet files created with Parquet > 1.10.0 and later. > There are two options how to enable push down for files generated with prior > Parquet versions, when user exactly knows that binary data is in ASCII (not > UTF-8): > 1. set configuration {{enableStringsSignedMinMax}} to true (false by default) > for parquet format plugin: > {noformat} > "parquet" : { > type: "parquet", > enableStringsSignedMinMax: true > } > {noformat} > This would apply to all parquet files of a given file plugin, including all > workspaces. > 2. If user wants to enable / disable allowing reading binary statistics for > old parquet files per session, session option > {{store.parquet.reader.strings_signed_min_max}} can be used. By default, it > has empty string value. Setting such option will take priority over config in > parquet format plugin. Option allows three values: 'true', 'false', '' (empty > string). > _Note: store.parquet.reader.strings_signed_min_max also can be set at system > level, thus it will apply to all parquet files in the system._ > The same config / session option will apply to allow reading binary >
[jira] [Commented] (DRILL-6744) Support filter push down for varchar / decimal data types
[ https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686351#comment-16686351 ] ASF GitHub Bot commented on DRILL-6744: --- vvysotskyi commented on a change in pull request #1537: DRILL-6744: Support varchar and decimal push down URL: https://github.com/apache/drill/pull/1537#discussion_r233127110 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderConfig.java ## @@ -0,0 +1,175 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.parquet; + +import com.fasterxml.jackson.annotation.JsonInclude; +import org.apache.drill.exec.ExecConstants; +import org.apache.drill.exec.server.options.OptionManager; +import org.apache.hadoop.conf.Configuration; +import org.apache.parquet.ParquetReadOptions; + +import java.util.Objects; + +import static org.apache.parquet.format.converter.ParquetMetadataConverter.NO_FILTER; + +/** + * Stores consolidated parquet reading configuration. Can obtain config values from various sources: + * Assignment priority of configuration values is the following: + * parquet format config + * Hadoop configuration + * session options + * + * During serialization does not deserialize the default values in keep serialized object smaller. + * Should be initialized using {@link Builder}, constructor is made public only for ser / de purposes. + */ +@JsonInclude(JsonInclude.Include.NON_DEFAULT) +public class ParquetReaderConfig { + + public static final String ENABLE_BYTES_READ_COUNTER = "parquet.benchmark.bytes.read"; + public static final String ENABLE_BYTES_TOTAL_COUNTER = "parquet.benchmark.bytes.total"; + public static final String ENABLE_TIME_READ_COUNTER = "parquet.benchmark.time.read"; + + // keep variables public for ser / de to avoid creating getters and constructor with params for all variables + // add defaults to keep deserialized object smaller + public boolean enableBytesReadCounter = false; + public boolean enableBytesTotalCounter = false; + public boolean enableTimeReadCounter = false; + public boolean autoCorrectCorruptedDates = true; + public boolean enableStringsSignedMinMax = false; + + public static ParquetReaderConfig.Builder builder() { +return new ParquetReaderConfig.Builder(); + } + + public static ParquetReaderConfig getDefaultInstance() { +return new ParquetReaderConfig(); + } + + // default constructor should be used only for ser / de and testing + public ParquetReaderConfig() { } + + public boolean autoCorrectCorruptedDates() { +return autoCorrectCorruptedDates; + } + + public boolean enableStringsSignedMinMax() { +return enableStringsSignedMinMax; + } + + public ParquetReadOptions toReadOptions() { +return ParquetReadOptions.builder() + .withMetadataFilter(NO_FILTER) Review comment: `NO_FILTER` is set in `ParquetReadOptions.Builder` by default, so it may be removed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support filter push down for varchar / decimal data types > - > > Key: DRILL-6744 > URL: https://issues.apache.org/jira/browse/DRILL-6744 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.14.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.15.0 > > > Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly > stored varchar / decimal min / max statistics is resolved, we should add > support for varchar / decimal filter push down. Only files created with > parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In > cases if user knows that prior created files have correct min
[jira] [Commented] (DRILL-6744) Support filter push down for varchar / decimal data types
[ https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686354#comment-16686354 ] ASF GitHub Bot commented on DRILL-6744: --- vvysotskyi commented on a change in pull request #1537: DRILL-6744: Support varchar and decimal push down URL: https://github.com/apache/drill/pull/1537#discussion_r233394516 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetMetaStatCollector.java ## @@ -132,62 +129,163 @@ public ParquetMetaStatCollector(ParquetTableMetadataBase parquetTableMetadata, } /** - * Builds column statistics using given primitiveType, originalType, scale, - * precision, numNull, min and max values. + * Helper class that creates parquet {@link ColumnStatistics} based on given + * min and max values, type, number of nulls, precision and scale. * - * @param min min value for statistics - * @param max max value for statistics - * @param numNullsnum_nulls for statistics - * @param primitiveType type that determines statistics class - * @param originalTypetype that determines statistics class - * @param scale scale value (used for DECIMAL type) - * @param precision precision value (used for DECIMAL type) - * @return column statistics */ - private ColumnStatistics getStat(Object min, Object max, long numNulls, - PrimitiveType.PrimitiveTypeName primitiveType, OriginalType originalType, - int scale, int precision) { -Statistics stat = Statistics.getStatsBasedOnType(primitiveType); -Statistics convertedStat = stat; - -TypeProtos.MajorType type = ParquetReaderUtility.getType(primitiveType, originalType, scale, precision); -stat.setNumNulls(numNulls); - -if (min != null && max != null ) { - switch (type.getMinorType()) { - case INT : - case TIME: -((IntStatistics) stat).setMinMax(Integer.parseInt(min.toString()), Integer.parseInt(max.toString())); -break; - case BIGINT: - case TIMESTAMP: -((LongStatistics) stat).setMinMax(Long.parseLong(min.toString()), Long.parseLong(max.toString())); -break; - case FLOAT4: -((FloatStatistics) stat).setMinMax(Float.parseFloat(min.toString()), Float.parseFloat(max.toString())); -break; - case FLOAT8: -((DoubleStatistics) stat).setMinMax(Double.parseDouble(min.toString()), Double.parseDouble(max.toString())); -break; - case DATE: -convertedStat = new LongStatistics(); -convertedStat.setNumNulls(stat.getNumNulls()); -final long minMS = convertToDrillDateValue(Integer.parseInt(min.toString())); -final long maxMS = convertToDrillDateValue(Integer.parseInt(max.toString())); -((LongStatistics) convertedStat ).setMinMax(minMS, maxMS); -break; - case BIT: -((BooleanStatistics) stat).setMinMax(Boolean.parseBoolean(min.toString()), Boolean.parseBoolean(max.toString())); -break; - default: - } + private static class ColumnStatisticsBuilder { + +private Object min; +private Object max; +private long numNulls; +private PrimitiveType.PrimitiveTypeName primitiveType; +private OriginalType originalType; +private int scale; +private int precision; + +static ColumnStatisticsBuilder builder() { + return new ColumnStatisticsBuilder(); } -return new ColumnStatistics(convertedStat, type); - } +ColumnStatisticsBuilder setMin(Object min) { + this.min = min; + return this; +} + +ColumnStatisticsBuilder setMax(Object max) { + this.max = max; + return this; +} + +ColumnStatisticsBuilder setNumNulls(long numNulls) { + this.numNulls = numNulls; + return this; +} + +ColumnStatisticsBuilder setPrimitiveType(PrimitiveType.PrimitiveTypeName primitiveType) { + this.primitiveType = primitiveType; + return this; +} + +ColumnStatisticsBuilder setOriginalType(OriginalType originalType) { + this.originalType = originalType; + return this; +} - private static long convertToDrillDateValue(int dateValue) { +ColumnStatisticsBuilder setScale(int scale) { + this.scale = scale; + return this; +} + +ColumnStatisticsBuilder setPrecision(int precision) { + this.precision = precision; + return this; +} + + +/** + * Builds column statistics using given primitive and original types, + * scale, precision, number of nulls, min and max values. + * Min and max values for binary statistics are set only if allowed. + * + * @return column statistics + */ +ColumnStatistics build() { + Statistics stat = Statistics.getStatsBasedOnType(primitiveType); + Statistics convertedStat
[jira] [Commented] (DRILL-6744) Support filter push down for varchar / decimal data types
[ https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686355#comment-16686355 ] ASF GitHub Bot commented on DRILL-6744: --- vvysotskyi commented on a change in pull request #1537: DRILL-6744: Support varchar and decimal push down URL: https://github.com/apache/drill/pull/1537#discussion_r233397050 ## File path: exec/java-exec/src/test/java/org/apache/drill/exec/store/parquet/TestParquetReaderConfig.java ## @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.parquet; + +import com.fasterxml.jackson.databind.ObjectMapper; +import org.apache.drill.common.config.DrillConfig; +import org.apache.drill.exec.ExecConstants; +import org.apache.drill.exec.server.options.SystemOptionManager; +import org.apache.hadoop.conf.Configuration; +import org.apache.parquet.ParquetReadOptions; +import org.junit.Test; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertNotNull; +import static org.junit.Assert.assertTrue; + +public class TestParquetReaderConfig { + + @Test + public void testDefaultsDeserialization() throws Exception { +ObjectMapper mapper = new ObjectMapper(); +ParquetReaderConfig readerConfig = ParquetReaderConfig.builder().build(); // all defaults +String value = mapper.writeValueAsString(readerConfig); +assertEquals("{}", value); + +readerConfig = mapper.readValue(value, ParquetReaderConfig.class); +assertTrue(readerConfig.autoCorrectCorruptedDates); // check that default value is restored + +readerConfig.autoCorrectCorruptedDates = false; // change the default +readerConfig.enableStringsSignedMinMax = false; // update to the default + +value = mapper.writeValueAsString(readerConfig); +assertEquals("{\"autoCorrectCorruptedDates\":false}", value); + } + + @Test + public void testAddConfigToConf() { +Configuration conf = new Configuration(); +conf.setBoolean(ParquetReaderConfig.ENABLE_BYTES_READ_COUNTER, true); +conf.setBoolean(ParquetReaderConfig.ENABLE_BYTES_TOTAL_COUNTER, true); +conf.setBoolean(ParquetReaderConfig.ENABLE_TIME_READ_COUNTER, true); + +ParquetReaderConfig readerConfig = ParquetReaderConfig.builder().withConf(conf).build(); +Configuration newConf = readerConfig.addCountersToConf(new Configuration()); +checkConfigValue(newConf, ParquetReaderConfig.ENABLE_BYTES_READ_COUNTER, "true"); +checkConfigValue(newConf, ParquetReaderConfig.ENABLE_BYTES_TOTAL_COUNTER, "true"); +checkConfigValue(newConf, ParquetReaderConfig.ENABLE_TIME_READ_COUNTER, "true"); + +conf = new Configuration(); +conf.setBoolean(ParquetReaderConfig.ENABLE_BYTES_READ_COUNTER, false); +conf.setBoolean(ParquetReaderConfig.ENABLE_BYTES_TOTAL_COUNTER, false); +conf.setBoolean(ParquetReaderConfig.ENABLE_TIME_READ_COUNTER, false); + +readerConfig = ParquetReaderConfig.builder().withConf(conf).build(); +newConf = readerConfig.addCountersToConf(new Configuration()); +checkConfigValue(newConf, ParquetReaderConfig.ENABLE_BYTES_READ_COUNTER, "false"); +checkConfigValue(newConf, ParquetReaderConfig.ENABLE_BYTES_TOTAL_COUNTER, "false"); +checkConfigValue(newConf, ParquetReaderConfig.ENABLE_TIME_READ_COUNTER, "false"); + } + + @Test + public void testReadOptions() { +ParquetReaderConfig readerConfig = new ParquetReaderConfig(); Review comment: Looks like we contravene the recommendation from its Javadoc This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support filter push down for varchar / decimal data types > - > > Key: DRILL-6744 > URL: https://issues.apache.org/jira/browse/DRILL-6744 > Project: Apache Drill >
[jira] [Commented] (DRILL-6744) Support filter push down for varchar / decimal data types
[ https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686352#comment-16686352 ] ASF GitHub Bot commented on DRILL-6744: --- vvysotskyi commented on a change in pull request #1537: DRILL-6744: Support varchar and decimal push down URL: https://github.com/apache/drill/pull/1537#discussion_r233390608 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetReaderConfig.java ## @@ -0,0 +1,175 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.store.parquet; + +import com.fasterxml.jackson.annotation.JsonInclude; +import org.apache.drill.exec.ExecConstants; +import org.apache.drill.exec.server.options.OptionManager; +import org.apache.hadoop.conf.Configuration; +import org.apache.parquet.ParquetReadOptions; + +import java.util.Objects; + +import static org.apache.parquet.format.converter.ParquetMetadataConverter.NO_FILTER; + +/** + * Stores consolidated parquet reading configuration. Can obtain config values from various sources: + * Assignment priority of configuration values is the following: + * parquet format config + * Hadoop configuration + * session options + * + * During serialization does not deserialize the default values in keep serialized object smaller. + * Should be initialized using {@link Builder}, constructor is made public only for ser / de purposes. + */ +@JsonInclude(JsonInclude.Include.NON_DEFAULT) +public class ParquetReaderConfig { Review comment: Is it possible to make this class immutable, so in `getDefaultInstance()` method we will be able to use the same object instead of instantiating the new one? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support filter push down for varchar / decimal data types > - > > Key: DRILL-6744 > URL: https://issues.apache.org/jira/browse/DRILL-6744 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.14.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.15.0 > > > Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly > stored varchar / decimal min / max statistics is resolved, we should add > support for varchar / decimal filter push down. Only files created with > parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In > cases if user knows that prior created files have correct min / max > statistics (i.e. user exactly knows that data in binary columns in ASCII (not > UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to > enable filter push down. > *Description* > _Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._ > *Varchar Partition Pruning* > Varchar Pruning will work for files generated prior and after Parquet 1.10.0 > version, since to enable partition pruning both min and max values should be > the same and there are no issues with incorrectly stored statistics for > binary data for the same min and max values. Partition pruning using Drill > metadata files will also work, no matter when metadata file was created > (prior or after Drill 1.15.0). > Partition pruning won't work for files where partition is null due to > PARQUET-1341, issue will be fixed in Parquet 1.11.0. > *Varchar Filter Push Down* > Varchar filter push down will work for parquet files created with Parquet > 1.10.0 and later. > There are two options how to enable push down for files generated with prior > Parquet versions, when user exactly knows that binary data is in ASCII (not > UTF-8): > 1. set configuration {{enableStringsSignedMinMax}} to true (false by default) > for parquet format plugin: >
[jira] [Commented] (DRILL-6744) Support filter push down for varchar / decimal data types
[ https://issues.apache.org/jira/browse/DRILL-6744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686350#comment-16686350 ] ASF GitHub Bot commented on DRILL-6744: --- vvysotskyi commented on a change in pull request #1537: DRILL-6744: Support varchar and decimal push down URL: https://github.com/apache/drill/pull/1537#discussion_r233132202 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/stat/ParquetMetaStatCollector.java ## @@ -86,30 +87,26 @@ public ParquetMetaStatCollector(ParquetTableMetadataBase parquetTableMetadata, columnMetadataMap.put(schemaPath, columnMetadata); } -for (final SchemaPath field : fields) { - final PrimitiveType.PrimitiveTypeName primitiveType; - final OriginalType originalType; - - final ColumnMetadata columnMetadata = columnMetadataMap.get(field.getUnIndexed()); - +for (SchemaPath field : fields) { + ColumnMetadata columnMetadata = columnMetadataMap.get(field.getUnIndexed()); if (columnMetadata != null) { -final Object min = columnMetadata.getMinValue(); -final Object max = columnMetadata.getMaxValue(); -final long numNulls = columnMetadata.getNulls() == null ? -1 : columnMetadata.getNulls(); - -primitiveType = this.parquetTableMetadata.getPrimitiveType(columnMetadata.getName()); -originalType = this.parquetTableMetadata.getOriginalType(columnMetadata.getName()); -int precision = 0; -int scale = 0; +ColumnStatisticsBuilder statisticsBuilder = ColumnStatisticsBuilder.builder() + .setMin(columnMetadata.getMinValue()) + .setMax(columnMetadata.getMaxValue()) + .setNumNulls(columnMetadata.getNulls() == null ? -1 : columnMetadata.getNulls()) Review comment: Please replace -1 with `GroupScan.NO_COLUMN_STATS`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support filter push down for varchar / decimal data types > - > > Key: DRILL-6744 > URL: https://issues.apache.org/jira/browse/DRILL-6744 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.14.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.15.0 > > > Since now Drill is using Apache Parquet 1.10.0 where issue with incorrectly > stored varchar / decimal min / max statistics is resolved, we should add > support for varchar / decimal filter push down. Only files created with > parquet lib 1.9.1 (1.10.0)) and later will be subjected to push down. In > cases if user knows that prior created files have correct min / max > statistics (i.e. user exactly knows that data in binary columns in ASCII (not > UTF-8)) than parquet.strings.signed-min-max.enabled can be set to true to > enable filter push down. > *Description* > _Note: Drill is using Parquet 1.10.0 library since 1.13.0 version._ > *Varchar Partition Pruning* > Varchar Pruning will work for files generated prior and after Parquet 1.10.0 > version, since to enable partition pruning both min and max values should be > the same and there are no issues with incorrectly stored statistics for > binary data for the same min and max values. Partition pruning using Drill > metadata files will also work, no matter when metadata file was created > (prior or after Drill 1.15.0). > Partition pruning won't work for files where partition is null due to > PARQUET-1341, issue will be fixed in Parquet 1.11.0. > *Varchar Filter Push Down* > Varchar filter push down will work for parquet files created with Parquet > 1.10.0 and later. > There are two options how to enable push down for files generated with prior > Parquet versions, when user exactly knows that binary data is in ASCII (not > UTF-8): > 1. set configuration {{enableStringsSignedMinMax}} to true (false by default) > for parquet format plugin: > {noformat} > "parquet" : { > type: "parquet", > enableStringsSignedMinMax: true > } > {noformat} > This would apply to all parquet files of a given file plugin, including all > workspaces. > 2. If user wants to enable / disable allowing reading binary statistics for > old parquet files per session, session option > {{store.parquet.reader.strings_signed_min_max}} can be used. By default, it > has empty string value. Setting such option will take priority over config in > parquet format plugin. Option allows three values: 'true', 'false', '' (empty
[jira] [Commented] (DRILL-6847) Add Query Metadata to RESTful Interface
[ https://issues.apache.org/jira/browse/DRILL-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686189#comment-16686189 ] ASF GitHub Bot commented on DRILL-6847: --- arina-ielchiieva commented on a change in pull request #1539: DRILL-6847: Add Query Metadata to RESTful Interface URL: https://github.com/apache/drill/pull/1539#discussion_r233348011 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/WebUserConnection.java ## @@ -106,7 +110,10 @@ public void sendData(RpcOutcomeListener listener, QueryWritableBatch result // TODO: Clean: DRILL-2933: That load(...) no longer throws // SchemaChangeException, so check/clean catch clause below. for (int i = 0; i < loader.getSchema().getFieldCount(); ++i) { - columns.add(loader.getSchema().getColumn(i).getName()); + + MaterializedField col = loader.getSchema().getColumn(i); + columns.add(col.getName()); + metadata.add(col.getType().getMinorType().name()); Review comment: Some types can have precision and scale, does this information needed? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Query Metadata to RESTful Interface > --- > > Key: DRILL-6847 > URL: https://issues.apache.org/jira/browse/DRILL-6847 > Project: Apache Drill > Issue Type: Improvement > Components: Metadata >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Minor > > The Drill RESTful interface does not return the structure of the query > results. This makes integrating Drill with other BI tools difficult because > they do not know what kind of data to expect. > This PR adds a new section to the results called Metadata which contains a > list of the minor types of all the columns returned. > The query below will now return the following in the RESTful interface: > {code:sql} > SELECT CAST( employee_id AS INT) AS employee_id, > full_name, > first_name, > last_name, > CAST( position_id AS BIGINT) AS position_id, > position_title > FROM cp.`employee.json` LIMIT 2 > {code} > {code} > { > "queryId": "2414bf3f-b4f4-d4df-825f-73dfb3a56681", > "columns": [ > "employee_id", > "full_name", > "first_name", > "last_name", > "position_id", > "position_title" > ], > "metadata": [ > "INT", > "VARCHAR", > "VARCHAR", > "VARCHAR", > "BIGINT", > "VARCHAR" > ], > "rows": [ > { > "full_name": "Sheri Nowmer", > "employee_id": "1", > "last_name": "Nowmer", > "position_title": "President", > "first_name": "Sheri", > "position_id": "1" > }, > { > "full_name": "Derrick Whelply", > "employee_id": "2", > "last_name": "Whelply", > "position_title": "VP Country Manager", > "first_name": "Derrick", > "position_id": "2" > } > ] > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6847) Add Query Metadata to RESTful Interface
[ https://issues.apache.org/jira/browse/DRILL-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686190#comment-16686190 ] ASF GitHub Bot commented on DRILL-6847: --- arina-ielchiieva commented on a change in pull request #1539: DRILL-6847: Add Query Metadata to RESTful Interface URL: https://github.com/apache/drill/pull/1539#discussion_r233348011 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/server/rest/WebUserConnection.java ## @@ -106,7 +110,10 @@ public void sendData(RpcOutcomeListener listener, QueryWritableBatch result // TODO: Clean: DRILL-2933: That load(...) no longer throws // SchemaChangeException, so check/clean catch clause below. for (int i = 0; i < loader.getSchema().getFieldCount(); ++i) { - columns.add(loader.getSchema().getColumn(i).getName()); + + MaterializedField col = loader.getSchema().getColumn(i); + columns.add(col.getName()); + metadata.add(col.getType().getMinorType().name()); Review comment: Some types can have precision and scale, is this information needed? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Query Metadata to RESTful Interface > --- > > Key: DRILL-6847 > URL: https://issues.apache.org/jira/browse/DRILL-6847 > Project: Apache Drill > Issue Type: Improvement > Components: Metadata >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Minor > > The Drill RESTful interface does not return the structure of the query > results. This makes integrating Drill with other BI tools difficult because > they do not know what kind of data to expect. > This PR adds a new section to the results called Metadata which contains a > list of the minor types of all the columns returned. > The query below will now return the following in the RESTful interface: > {code:sql} > SELECT CAST( employee_id AS INT) AS employee_id, > full_name, > first_name, > last_name, > CAST( position_id AS BIGINT) AS position_id, > position_title > FROM cp.`employee.json` LIMIT 2 > {code} > {code} > { > "queryId": "2414bf3f-b4f4-d4df-825f-73dfb3a56681", > "columns": [ > "employee_id", > "full_name", > "first_name", > "last_name", > "position_id", > "position_title" > ], > "metadata": [ > "INT", > "VARCHAR", > "VARCHAR", > "VARCHAR", > "BIGINT", > "VARCHAR" > ], > "rows": [ > { > "full_name": "Sheri Nowmer", > "employee_id": "1", > "last_name": "Nowmer", > "position_title": "President", > "first_name": "Sheri", > "position_id": "1" > }, > { > "full_name": "Derrick Whelply", > "employee_id": "2", > "last_name": "Whelply", > "position_title": "VP Country Manager", > "first_name": "Derrick", > "position_id": "2" > } > ] > } > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)