[jira] [Commented] (DRILL-2282) Eliminate spaces, special characters from names in function templates
[ https://issues.apache.org/jira/browse/DRILL-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140298#comment-15140298 ] Mehant Baid commented on DRILL-2282: [~parthc] There was a specific issue with 'similar' function as noted here: [DRILL-1496|https://issues.apache.org/jira/browse/DRILL-1496] that was fixed, but this is a more generic JIRA to make sure we don't run into a similar issue. If i recall correctly there was a problem in deserializing the plan fragment if we had a space while serializing the expression. > Eliminate spaces, special characters from names in function templates > - > > Key: DRILL-2282 > URL: https://issues.apache.org/jira/browse/DRILL-2282 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Reporter: Mehant Baid >Assignee: Vitalii Diravka > Fix For: 1.6.0 > > Attachments: DRILL-2282.patch > > > Having spaces in the name of the functions causes issues while deserializing > such expressions when we try to read the plan fragment. As part of this JIRA > would like to clean up all the templates to not include special characters in > their names. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-4235) Hit IllegalStateException when exec.queue.enable=ture
[ https://issues.apache.org/jira/browse/DRILL-4235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Altekruse updated DRILL-4235: --- Fix Version/s: (was: 1.6.0) 1.5.0 > Hit IllegalStateException when exec.queue.enable=ture > -- > > Key: DRILL-4235 > URL: https://issues.apache.org/jira/browse/DRILL-4235 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Affects Versions: 1.5.0 > Environment: git.commit.id=6dea429949a3d6a68aefbdb3d78de41e0955239b >Reporter: Dechang Gu >Assignee: Deneche A. Hakim >Priority: Critical > Fix For: 1.5.0 > > > 0: jdbc:drill:schema=dfs.parquet> select * from sys.options; > Error: SYSTEM ERROR: IllegalStateException: Failure trying to change states: > ENQUEUED --> RUNNING > [Error Id: 6ac8167c-6fb7-4274-9e5c-bf62a195c06e on ucs-node5.perf.lab:31010] > (org.apache.drill.exec.work.foreman.ForemanException) Unexpected exception > during fragment initialization: Exceptions caught during event processing > org.apache.drill.exec.work.foreman.Foreman.run():261 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 > Caused By (java.lang.RuntimeException) Exceptions caught during event > processing > org.apache.drill.common.EventProcessor.sendEvent():93 > org.apache.drill.exec.work.foreman.Foreman$StateSwitch.moveToState():792 > org.apache.drill.exec.work.foreman.Foreman.moveToState():909 > org.apache.drill.exec.work.foreman.Foreman.runPhysicalPlan():420 > org.apache.drill.exec.work.foreman.Foreman.runSQL():926 > org.apache.drill.exec.work.foreman.Foreman.run():250 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 > Caused By (java.lang.IllegalStateException) Failure trying to change > states: ENQUEUED --> RUNNING > org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent():896 > org.apache.drill.exec.work.foreman.Foreman$StateSwitch.processEvent():790 > org.apache.drill.common.EventProcessor.sendEvent():73 > org.apache.drill.exec.work.foreman.Foreman$StateSwitch.moveToState():792 > org.apache.drill.exec.work.foreman.Foreman.moveToState():909 > org.apache.drill.exec.work.foreman.Foreman.runPhysicalPlan():420 > org.apache.drill.exec.work.foreman.Foreman.runSQL():926 > org.apache.drill.exec.work.foreman.Foreman.run():250 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 (state=,code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (DRILL-4349) parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)
[ https://issues.apache.org/jira/browse/DRILL-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Altekruse closed DRILL-4349. -- Resolution: Fixed Reviewer: (was: Chun Chang) > parquet reader returns wrong results when reading a nullable column that > starts with a large number of nulls (>30k) > --- > > Key: DRILL-4349 > URL: https://issues.apache.org/jira/browse/DRILL-4349 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.4.0 >Reporter: Deneche A. Hakim >Assignee: Jason Altekruse >Priority: Critical > Fix For: 1.5.0 > > Attachments: drill4349.tar.gz > > > While reading a nullable column, if in a single pass we only read null > values, the parquet reader resets the value of pageReader.readPosInBytes > which will lead to wrong data read from the file. > To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, > val) with 50100 rows, where id equals to the row number and val is empty for > the first 50k rows, and equal to id for the remaining rows. > create a parquet table from the csv file: > {noformat} > CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, > CAST(NULLIF(columns[1], '') AS DOUBLE) AS val from `repro.csv`; > {noformat} > Now if you query any of the non null values you will get wrong results: > {noformat} > 0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=5 limit > 10; > ++---+ > | id |val| > ++---+ > | 5 | 9.11337776337441E-309 | > | 50001 | 3.26044E-319 | > | 50002 | 1.4916681476489723E-154 | > | 50003 | 2.18890676| > | 50004 | 2.681561588521345E154 | > | 50005 | -2.1016574E-317 | > | 50006 | -1.4916681476489723E-154 | > | 50007 | -2.18890676 | > | 50008 | -2.681561588521345E154| > | 50009 | 2.1016574E-317| > ++---+ > 10 rows selected (0.238 seconds) > {noformat} > and here are the expected values: > {noformat} > 0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as > int)>=5 limit 10; > ++ > | columns | > ++ > | ["5","5"] | > | ["50001","50001"] | > | ["50002","50002"] | > | ["50003","50003"] | > | ["50004","50004"] | > | ["50005","50005"] | > | ["50006","50006"] | > | ["50007","50007"] | > | ["50008","50008"] | > | ["50009","50009"] | > ++ > {noformat} > I confirmed that the file is written correctly and the issue is in the > parquet reader (already have a fix for it) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-4349) parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)
[ https://issues.apache.org/jira/browse/DRILL-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Altekruse updated DRILL-4349: --- Fix Version/s: (was: 1.6.0) 1.5.0 > parquet reader returns wrong results when reading a nullable column that > starts with a large number of nulls (>30k) > --- > > Key: DRILL-4349 > URL: https://issues.apache.org/jira/browse/DRILL-4349 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.4.0 >Reporter: Deneche A. Hakim >Assignee: Jason Altekruse >Priority: Critical > Fix For: 1.5.0 > > Attachments: drill4349.tar.gz > > > While reading a nullable column, if in a single pass we only read null > values, the parquet reader resets the value of pageReader.readPosInBytes > which will lead to wrong data read from the file. > To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, > val) with 50100 rows, where id equals to the row number and val is empty for > the first 50k rows, and equal to id for the remaining rows. > create a parquet table from the csv file: > {noformat} > CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, > CAST(NULLIF(columns[1], '') AS DOUBLE) AS val from `repro.csv`; > {noformat} > Now if you query any of the non null values you will get wrong results: > {noformat} > 0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=5 limit > 10; > ++---+ > | id |val| > ++---+ > | 5 | 9.11337776337441E-309 | > | 50001 | 3.26044E-319 | > | 50002 | 1.4916681476489723E-154 | > | 50003 | 2.18890676| > | 50004 | 2.681561588521345E154 | > | 50005 | -2.1016574E-317 | > | 50006 | -1.4916681476489723E-154 | > | 50007 | -2.18890676 | > | 50008 | -2.681561588521345E154| > | 50009 | 2.1016574E-317| > ++---+ > 10 rows selected (0.238 seconds) > {noformat} > and here are the expected values: > {noformat} > 0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as > int)>=5 limit 10; > ++ > | columns | > ++ > | ["5","5"] | > | ["50001","50001"] | > | ["50002","50002"] | > | ["50003","50003"] | > | ["50004","50004"] | > | ["50005","50005"] | > | ["50006","50006"] | > | ["50007","50007"] | > | ["50008","50008"] | > | ["50009","50009"] | > ++ > {noformat} > I confirmed that the file is written correctly and the issue is in the > parquet reader (already have a fix for it) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (DRILL-4349) parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)
[ https://issues.apache.org/jira/browse/DRILL-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Altekruse reopened DRILL-4349: Assignee: Jason Altekruse (was: Deneche A. Hakim) > parquet reader returns wrong results when reading a nullable column that > starts with a large number of nulls (>30k) > --- > > Key: DRILL-4349 > URL: https://issues.apache.org/jira/browse/DRILL-4349 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.4.0 >Reporter: Deneche A. Hakim >Assignee: Jason Altekruse >Priority: Critical > Fix For: 1.5.0 > > Attachments: drill4349.tar.gz > > > While reading a nullable column, if in a single pass we only read null > values, the parquet reader resets the value of pageReader.readPosInBytes > which will lead to wrong data read from the file. > To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, > val) with 50100 rows, where id equals to the row number and val is empty for > the first 50k rows, and equal to id for the remaining rows. > create a parquet table from the csv file: > {noformat} > CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, > CAST(NULLIF(columns[1], '') AS DOUBLE) AS val from `repro.csv`; > {noformat} > Now if you query any of the non null values you will get wrong results: > {noformat} > 0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=5 limit > 10; > ++---+ > | id |val| > ++---+ > | 5 | 9.11337776337441E-309 | > | 50001 | 3.26044E-319 | > | 50002 | 1.4916681476489723E-154 | > | 50003 | 2.18890676| > | 50004 | 2.681561588521345E154 | > | 50005 | -2.1016574E-317 | > | 50006 | -1.4916681476489723E-154 | > | 50007 | -2.18890676 | > | 50008 | -2.681561588521345E154| > | 50009 | 2.1016574E-317| > ++---+ > 10 rows selected (0.238 seconds) > {noformat} > and here are the expected values: > {noformat} > 0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as > int)>=5 limit 10; > ++ > | columns | > ++ > | ["5","5"] | > | ["50001","50001"] | > | ["50002","50002"] | > | ["50003","50003"] | > | ["50004","50004"] | > | ["50005","50005"] | > | ["50006","50006"] | > | ["50007","50007"] | > | ["50008","50008"] | > | ["50009","50009"] | > ++ > {noformat} > I confirmed that the file is written correctly and the issue is in the > parquet reader (already have a fix for it) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4349) parquet reader returns wrong results when reading a nullable column that starts with a large number of nulls (>30k)
[ https://issues.apache.org/jira/browse/DRILL-4349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15140236#comment-15140236 ] Jason Altekruse commented on DRILL-4349: As I was rolling the rc3 release candidate for 1.5.0 I decided to apply this fix to the release branch as it seemed useful to get into the release. The commit hash will be different but the patch applied cleanly and has an identical diff represented. > parquet reader returns wrong results when reading a nullable column that > starts with a large number of nulls (>30k) > --- > > Key: DRILL-4349 > URL: https://issues.apache.org/jira/browse/DRILL-4349 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.4.0 >Reporter: Deneche A. Hakim >Assignee: Deneche A. Hakim >Priority: Critical > Fix For: 1.5.0 > > Attachments: drill4349.tar.gz > > > While reading a nullable column, if in a single pass we only read null > values, the parquet reader resets the value of pageReader.readPosInBytes > which will lead to wrong data read from the file. > To reproduce the issue, create a csv file (repro.csv) with 2 columns (id, > val) with 50100 rows, where id equals to the row number and val is empty for > the first 50k rows, and equal to id for the remaining rows. > create a parquet table from the csv file: > {noformat} > CREATE TABLE `repro_parquet` AS SELECT CAST(columns[0] AS INT) AS id, > CAST(NULLIF(columns[1], '') AS DOUBLE) AS val from `repro.csv`; > {noformat} > Now if you query any of the non null values you will get wrong results: > {noformat} > 0: jdbc:drill:zk=local> select * from `repro_parquet` where id>=5 limit > 10; > ++---+ > | id |val| > ++---+ > | 5 | 9.11337776337441E-309 | > | 50001 | 3.26044E-319 | > | 50002 | 1.4916681476489723E-154 | > | 50003 | 2.18890676| > | 50004 | 2.681561588521345E154 | > | 50005 | -2.1016574E-317 | > | 50006 | -1.4916681476489723E-154 | > | 50007 | -2.18890676 | > | 50008 | -2.681561588521345E154| > | 50009 | 2.1016574E-317| > ++---+ > 10 rows selected (0.238 seconds) > {noformat} > and here are the expected values: > {noformat} > 0: jdbc:drill:zk=local> select * from `repro.csv` where cast(columns[0] as > int)>=5 limit 10; > ++ > | columns | > ++ > | ["5","5"] | > | ["50001","50001"] | > | ["50002","50002"] | > | ["50003","50003"] | > | ["50004","50004"] | > | ["50005","50005"] | > | ["50006","50006"] | > | ["50007","50007"] | > | ["50008","50008"] | > | ["50009","50009"] | > ++ > {noformat} > I confirmed that the file is written correctly and the issue is in the > parquet reader (already have a fix for it) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (DRILL-4230) NullReferenceException when SELECTing from empty mongo collection
[ https://issues.apache.org/jira/browse/DRILL-4230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Altekruse resolved DRILL-4230. Resolution: Fixed Fix Version/s: 1.5.0 Fixed in ed2f1ca8ed3c0ebac7e33494db6749851fc2c970 This was applied separately to the 1.5 release branch, so the commit there has identical content and the same commit message, but will have a different hash. > NullReferenceException when SELECTing from empty mongo collection > - > > Key: DRILL-4230 > URL: https://issues.apache.org/jira/browse/DRILL-4230 > Project: Apache Drill > Issue Type: Bug > Components: Storage - MongoDB >Affects Versions: 1.3.0 >Reporter: Brick Shitting Bird Jr. >Assignee: Jason Altekruse > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Altekruse updated DRILL-4380: --- Fix Version/s: 1.5.0 > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > Fix For: 1.5.0 > > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4184) Drill does not support Parquet DECIMAL values in variable length BINARY fields
[ https://issues.apache.org/jira/browse/DRILL-4184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139970#comment-15139970 ] ASF GitHub Bot commented on DRILL-4184: --- GitHub user daveoshinsky opened a pull request: https://github.com/apache/drill/pull/372 DRILL-4184: support variable length decimal fields in parquet Support decimal fields in parquet that are stored as variable length BINARY. Parquet files that store decimal values this way are often significantly smaller than ones storing decimal values as FIXED_LEN_BYTE_ARRAY's (full precision). You can merge this pull request into a Git repository by running: $ git pull https://github.com/daveoshinsky/drill master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/372.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #372 commit 9a47ca52125139d88adf39b5d894a02f870f37d9 Author: U-COMMVAULT-NJ\doshinsky Date: 2016-02-09T22:37:47Z DRILL-4184: support variable length decimal fields in parquet commit dec00a808c99554f008e23fd21b944b858aa9ae0 Author: daveoshinsky Date: 2016-02-09T22:56:28Z DRILL-4184: changes to support variable length decimal fields in parquet > Drill does not support Parquet DECIMAL values in variable length BINARY fields > -- > > Key: DRILL-4184 > URL: https://issues.apache.org/jira/browse/DRILL-4184 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.4.0 > Environment: Windows 7 Professional, Java 1.8.0_66 >Reporter: Dave Oshinsky > > Encoding a DECIMAL logical type in Parquet using the variable length BINARY > primitive type is not supported by Drill as of versions 1.3.0 and 1.4.0. The > problem first surfaces with the ClassCastException shown below, but fixing > the immediate cause of the exception is not sufficient to support this > combination (DECIMAL, BINARY) in a Parquet file. > In Drill, DECIMAL is currently assumed to be INT32, INT64, INT96, or > FIXED_LEN_BINARY_ARRAY. Are there any plans to support DECIMAL with variable > length BINARY? Avro definitely supports encoding DECIMAL in variable length > bytes (see https://avro.apache.org/docs/current/spec.html#Decimal), but this > support in Parquet is less clear. > Selecting on a BINARY DECIMAL field in a parquet file throws an exception as > shown below (java.lang.ClassCastException: > org.apache.drill.exec.vector.Decimal28SparseVector cannot be cast to > org.apache.drill.exec.vector.VariableWidthVector). The successful query at > bottom selected on a string field in the same file. > 0: jdbc:drill:zk=local> select count(*) from > dfs.`c:/dao/DBArchivePredictor/tenrows.parquet` where acct_no=7020; > org.apache.drill.common.exceptions.DrillRuntimeException: Error in parquet > recor > d reader. > Message: Failure in setting up reader > Parquet Metadata: ParquetMetaData{FileMetaData{schema: message sbi.acct_mstr { > required binary ACCT_NO (DECIMAL(20,0)); > optional binary SF_NO (UTF8); > optional binary LF_NO (UTF8); > optional binary BRANCH_NO (DECIMAL(20,0)); > optional binary INTRO_CUST_NO (DECIMAL(20,0)); > optional binary INTRO_ACCT_NO (DECIMAL(20,0)); > optional binary INTRO_SIGN (UTF8); > optional binary TYPE (UTF8); > optional binary OPR_MODE (UTF8); > optional binary CUR_ACCT_TYPE (UTF8); > optional binary TITLE (UTF8); > optional binary CORP_CUST_NO (DECIMAL(20,0)); > optional binary APLNDT (UTF8); > optional binary OPNDT (UTF8); > optional binary VERI_EMP_NO (DECIMAL(20,0)); > optional binary VERI_SIGN (UTF8); > optional binary MANAGER_SIGN (UTF8); > optional binary CURBAL (DECIMAL(8,2)); > optional binary STATUS (UTF8); > } > , metadata: > {parquet.avro.schema={"type":"record","name":"acct_mstr","namespace" > :"sbi","fields":[{"name":"ACCT_NO","type":{"type":"bytes","logicalType":"decimal > ","precision":20,"scale":0,"cv_auto_incr":false,"cv_case_sensitive":false,"cv_co > lumn_class":"java.math.BigDecimal","cv_connection":"oracle.jdbc.driver.T4CConnec > tion","cv_currency":true,"cv_def_writable":false,"cv_nullable":0,"cv_precision": > 20,"cv_read_only":false,"cv_scale":0,"cv_searchable":true,"cv_signed":true,"cv_s > ubscript":1,"cv_type":2,"cv_typename":"NUMBER","cv_writable":true}},{"name":"SF_ > NO","type":["null",{"type":"string","cv_auto_incr":false,"cv_case_sensitive":tru > e,"cv_column_class":"java.lang.String","cv_currency":false,"cv_def_writable":fal > se,"cv_nullable":1,"cv_precision":10,"cv_read_only":false,"cv_scale":0,"cv_searc > hable":true,"cv_signed":true,"cv_subscript":2,"cv_type":12,"cv_typena
[jira] [Assigned] (DRILL-4281) Drill should support inbound impersonation
[ https://issues.apache.org/jira/browse/DRILL-4281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sudheesh Katkam reassigned DRILL-4281: -- Assignee: Sudheesh Katkam > Drill should support inbound impersonation > -- > > Key: DRILL-4281 > URL: https://issues.apache.org/jira/browse/DRILL-4281 > Project: Apache Drill > Issue Type: Improvement >Reporter: Keys Botzum >Assignee: Sudheesh Katkam > Labels: security > > Today Drill supports impersonation *to* external sources. For example I can > authenticate to Drill as myself and then Drill will access HDFS using > impersonation > In many scenarios we also need impersonation to Drill. For example I might > use some front end tool (such as Tableau) and authenticate to it as myself. > That tool (server version) then needs to access Drill to perform queries and > I want those queries to run as myself, not as the Tableau user. While in > theory the intermediate tool could store the userid & password for every user > to the Drill this isn't a scalable or very secure solution. > Note that HS2 today does support inbound impersonation as described here: > https://issues.apache.org/jira/browse/HIVE-5155 > The above is not the best approach as it is tied to the connection object > which is very coarse grained and potentially expensive. It would be better if > there was a call on the ODBC/JDBC driver to switch the identity on a existing > connection. Most modern SQL databases (Oracle, DB2) support such function. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4345) Hive Native Reader reporting wrong results for timestamp column in hive generated parquet file
[ https://issues.apache.org/jira/browse/DRILL-4345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139927#comment-15139927 ] Rahul Challapalli commented on DRILL-4345: -- The underlying parquet file is generated using hive > Hive Native Reader reporting wrong results for timestamp column in hive > generated parquet file > -- > > Key: DRILL-4345 > URL: https://issues.apache.org/jira/browse/DRILL-4345 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Hive, Storage - Parquet >Reporter: Rahul Challapalli >Priority: Critical > Attachments: hive1_fewtypes_null.parquet > > > git.commit.id.abbrev=1b96174 > Below you can see different results returned from hive plugin and native > reader for the same table. > {code} > 0: jdbc:drill:zk=10.10.100.190:5181> use hive; > +---+---+ > | ok | summary | > +---+---+ > | true | Default schema changed to [hive] | > +---+---+ > 1 row selected (0.415 seconds) > 0: jdbc:drill:zk=10.10.100.190:5181> select int_col, timestamp_col from > hive1_fewtypes_null_parquet; > +--++ > | int_col | timestamp_col | > +--++ > | 1| null | > | null | 1997-01-02 00:00:00.0 | > | 3| null | > | 4| null | > | 5| 1997-02-10 17:32:00.0 | > | 6| 1997-02-11 17:32:01.0 | > | 7| 1997-02-12 17:32:01.0 | > | 8| 1997-02-13 17:32:01.0 | > | 9| null | > | 10 | 1997-02-15 17:32:01.0 | > | null | 1997-02-16 17:32:01.0 | > | 12 | 1897-02-18 17:32:01.0 | > | 13 | 2002-02-14 17:32:01.0 | > | 14 | 1991-02-10 17:32:01.0 | > | 15 | 1900-02-16 17:32:01.0 | > | 16 | null | > | null | 1897-02-16 17:32:01.0 | > | 18 | 1997-02-16 17:32:01.0 | > | null | null | > | 20 | 1996-02-28 17:32:01.0 | > | null | null | > +--++ > 21 rows selected (0.368 seconds) > 0: jdbc:drill:zk=10.10.100.190:5181> alter session set > `store.hive.optimize_scan_with_native_readers` = true; > +---++ > | ok |summary | > +---++ > | true | store.hive.optimize_scan_with_native_readers updated. | > +---++ > 1 row selected (0.213 seconds) > 0: jdbc:drill:zk=10.10.100.190:5181> select int_col, timestamp_col from > hive1_fewtypes_null_parquet; > +--++ > | int_col | timestamp_col | > +--++ > | 1| null | > | null | 1997-01-02 00:00:00.0 | > | 3| 1997-02-10 17:32:00.0 | > | 4| null | > | 5| 1997-02-11 17:32:01.0 | > | 6| 1997-02-12 17:32:01.0 | > | 7| 1997-02-13 17:32:01.0 | > | 8| 1997-02-15 17:32:01.0 | > | 9| 1997-02-16 17:32:01.0 | > | 10 | 1900-02-16 17:32:01.0 | > | null | 1897-02-16 17:32:01.0 | > | 12 | 1997-02-16 17:32:01.0 | > | 13 | 1996-02-28 17:32:01.0 | > | 14 | 1997-01-02 00:00:00.0 | > | 15 | 1997-01-02 00:00:00.0 | > | 16 | 1997-01-02 00:00:00.0 | > | null | 1997-01-02 00:00:00.0 | > | 18 | 1997-01-02 00:00:00.0 | > | null | 1997-01-02 00:00:00.0 | > | 20 | 1997-01-02 00:00:00.0 | > | null | 1997-01-02 00:00:00.0 | > +--++ > 21 rows selected (0.352 seconds) > {code} > DDL for hive table : > {code} > create external table hive1_fewtypes_null_parquet ( > int_col int, > bigint_col bigint, > date_col string, > time_col string, > timestamp_col timestamp, > interval_col string, > varchar_col string, > float_col float, > double_col double, > bool_col boolean > ) > stored as parquet > location '/drill/testdata/hive_storage/hive1_fewtypes_null'; > {code} > Attached the underlying parquet file -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4363) Apply row count based pruning for parquet table in LIMIT n query
[ https://issues.apache.org/jira/browse/DRILL-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139905#comment-15139905 ] Jinfeng Ni commented on DRILL-4363: --- [~amansinha100], could you please review the PR for DRILL-4363? Thanks! > Apply row count based pruning for parquet table in LIMIT n query > > > Key: DRILL-4363 > URL: https://issues.apache.org/jira/browse/DRILL-4363 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jinfeng Ni >Assignee: Aman Sinha > Fix For: 1.6.0 > > > In interactive data exploration use case, one common and probably first query > that users would use is " SELECT * from table LIMIT n", where n is a small > number. Such query will give user idea about the columns in the table. > Normally, user would expect such query should be completed in very short > time, since it's just asking for small amount of rows, without any > sort/aggregation. > When table is small, there is no big problem for Drill. However, when the > table is extremely large, Drill's response time is not as fast as what user > would expect. > In case of parquet table, it seems that query planner could do a bit better > job : by applying row count based pruning for such LIMIT n query. The > pruning is kind of similar to what partition pruning will do, except that it > uses row count, in stead of partition column values. Since row count is > available in parquet table, it's possible to do such pruning. > The benefit of doing such pruning is clear: 1) for small "n", such pruning > would end up with a few parquet files, in stead of thousands, or millions of > files to scan. 2) execution probably does not have to put scan into multiple > minor fragments and start reading the files concurrently, which will cause > big IO overhead. 3) the physical plan itself is much smaller, since it does > not include the long list of parquet files, reduce rpc cost of sending the > fragment plans to multiple drillbits, and the overhead to > serialize/deserialize the fragment plans. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-4363) Apply row count based pruning for parquet table in LIMIT n query
[ https://issues.apache.org/jira/browse/DRILL-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinfeng Ni updated DRILL-4363: -- Assignee: Aman Sinha (was: Jinfeng Ni) > Apply row count based pruning for parquet table in LIMIT n query > > > Key: DRILL-4363 > URL: https://issues.apache.org/jira/browse/DRILL-4363 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jinfeng Ni >Assignee: Aman Sinha > Fix For: 1.6.0 > > > In interactive data exploration use case, one common and probably first query > that users would use is " SELECT * from table LIMIT n", where n is a small > number. Such query will give user idea about the columns in the table. > Normally, user would expect such query should be completed in very short > time, since it's just asking for small amount of rows, without any > sort/aggregation. > When table is small, there is no big problem for Drill. However, when the > table is extremely large, Drill's response time is not as fast as what user > would expect. > In case of parquet table, it seems that query planner could do a bit better > job : by applying row count based pruning for such LIMIT n query. The > pruning is kind of similar to what partition pruning will do, except that it > uses row count, in stead of partition column values. Since row count is > available in parquet table, it's possible to do such pruning. > The benefit of doing such pruning is clear: 1) for small "n", such pruning > would end up with a few parquet files, in stead of thousands, or millions of > files to scan. 2) execution probably does not have to put scan into multiple > minor fragments and start reading the files concurrently, which will cause > big IO overhead. 3) the physical plan itself is much smaller, since it does > not include the long list of parquet files, reduce rpc cost of sending the > fragment plans to multiple drillbits, and the overhead to > serialize/deserialize the fragment plans. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139888#comment-15139888 ] ASF GitHub Bot commented on DRILL-4380: --- Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/369 > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Parth Chandra resolved DRILL-4380. -- Resolution: Fixed Fixed in 7bfcb40a0ffa49a1ed27e1ff1f57378aa1136bbd. Also see DRILL-4381 > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139872#comment-15139872 ] ASF GitHub Bot commented on DRILL-4380: --- Github user hnfgns commented on a diff in the pull request: https://github.com/apache/drill/pull/369#discussion_r52383230 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java --- @@ -233,7 +233,7 @@ private FileSelection expandSelection(DrillFileSystem fs, FileSelection selectio // /a/b/c.parquet and the format of the selection root must match that of the file names // otherwise downstream operations such as partition pruning can break. final Path metaRootPath = Path.getPathWithoutSchemeAndAuthority(metaRootDir.getPath()); -final FileSelection newSelection = FileSelection.create(null, fileNames, metaRootPath.toString()); +final FileSelection newSelection = new FileSelection(selection.getStatuses(fs), fileNames, metaRootPath.toString()); --- End diff -- Filed DRILL-4381. Thanks. > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-4381) Replace direct uses of FileSelection c'tor with create()
Hanifi Gunes created DRILL-4381: --- Summary: Replace direct uses of FileSelection c'tor with create() Key: DRILL-4381 URL: https://issues.apache.org/jira/browse/DRILL-4381 Project: Apache Drill Issue Type: Bug Reporter: Hanifi Gunes Assignee: Hanifi Gunes We should avoid direct creation of FileSelection. This patch proposes either a re-design or removing instances where FileSelection c'tor is used directly. We also need more documentation around FileSelection abstraction. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4363) Apply row count based pruning for parquet table in LIMIT n query
[ https://issues.apache.org/jira/browse/DRILL-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139864#comment-15139864 ] Jinfeng Ni commented on DRILL-4363: --- Did some performance comparison with two datasets: 1. Dataset containing 300 parquet files, with total 46M rows. Each parquet file has about 2000 columns. That's for wide table use case. Without the patch, the query will crash drillbits in the cluster, since the query is executed in multi minor fragments for Scan operator, and each minor scan fragment will use around 500M ~ 1GB memory. With the patch, the query completed in under 30 seconds, with warm cache. 2. Dataset containing 115 small parquet files. The file was created from TPCH lineitem table. Without patch, {code} select * from dfs.`/Users/jni/work/data/tpch-sf10/lineitem115k` limit 1; 1 row selected (34.165 seconds) {code} With patch {code} select * from dfs.`/Users/jni/work/data/tpch-sf10/lineitem115k` limit 1; 1 row selected (14.021 seconds) {code} Basically, it reduce from 34 seconds to 14 seconds with warm cache. > Apply row count based pruning for parquet table in LIMIT n query > > > Key: DRILL-4363 > URL: https://issues.apache.org/jira/browse/DRILL-4363 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > Fix For: 1.6.0 > > > In interactive data exploration use case, one common and probably first query > that users would use is " SELECT * from table LIMIT n", where n is a small > number. Such query will give user idea about the columns in the table. > Normally, user would expect such query should be completed in very short > time, since it's just asking for small amount of rows, without any > sort/aggregation. > When table is small, there is no big problem for Drill. However, when the > table is extremely large, Drill's response time is not as fast as what user > would expect. > In case of parquet table, it seems that query planner could do a bit better > job : by applying row count based pruning for such LIMIT n query. The > pruning is kind of similar to what partition pruning will do, except that it > uses row count, in stead of partition column values. Since row count is > available in parquet table, it's possible to do such pruning. > The benefit of doing such pruning is clear: 1) for small "n", such pruning > would end up with a few parquet files, in stead of thousands, or millions of > files to scan. 2) execution probably does not have to put scan into multiple > minor fragments and start reading the files concurrently, which will cause > big IO overhead. 3) the physical plan itself is much smaller, since it does > not include the long list of parquet files, reduce rpc cost of sending the > fragment plans to multiple drillbits, and the overhead to > serialize/deserialize the fragment plans. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4363) Apply row count based pruning for parquet table in LIMIT n query
[ https://issues.apache.org/jira/browse/DRILL-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139860#comment-15139860 ] ASF GitHub Bot commented on DRILL-4363: --- GitHub user jinfengni opened a pull request: https://github.com/apache/drill/pull/371 DRILL-4363: Row count based pruning for parquet table used in Limit n… … query. Modify two existint unit testcase: 1) TestPartitionFilter.testMainQueryFalseCondition(): rowCount pruning applied after false condition is transformed into LIMIT 0 2) TestLimitWithExchanges.testPushLimitPastUnionExchange(): modify the testcase to use Json source, so that it does not mix with PushLimitIntoScanRule. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jinfengni/incubator-drill DRILL-4363 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/371.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #371 commit a84d61fe2b820fe8395e73347dfb0e2986ed9dd0 Author: Jinfeng Ni Date: 2016-02-02T23:31:47Z DRILL-4363: Row count based pruning for parquet table used in Limit n query. Modify two existint unit testcase: 1) TestPartitionFilter.testMainQueryFalseCondition(): rowCount pruning applied after false condition is transformed into LIMIT 0 2) TestLimitWithExchanges.testPushLimitPastUnionExchange(): modify the testcase to use Json source, so that it does not mix with PushLimitIntoScanRule. > Apply row count based pruning for parquet table in LIMIT n query > > > Key: DRILL-4363 > URL: https://issues.apache.org/jira/browse/DRILL-4363 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > Fix For: 1.6.0 > > > In interactive data exploration use case, one common and probably first query > that users would use is " SELECT * from table LIMIT n", where n is a small > number. Such query will give user idea about the columns in the table. > Normally, user would expect such query should be completed in very short > time, since it's just asking for small amount of rows, without any > sort/aggregation. > When table is small, there is no big problem for Drill. However, when the > table is extremely large, Drill's response time is not as fast as what user > would expect. > In case of parquet table, it seems that query planner could do a bit better > job : by applying row count based pruning for such LIMIT n query. The > pruning is kind of similar to what partition pruning will do, except that it > uses row count, in stead of partition column values. Since row count is > available in parquet table, it's possible to do such pruning. > The benefit of doing such pruning is clear: 1) for small "n", such pruning > would end up with a few parquet files, in stead of thousands, or millions of > files to scan. 2) execution probably does not have to put scan into multiple > minor fragments and start reading the files concurrently, which will cause > big IO overhead. 3) the physical plan itself is much smaller, since it does > not include the long list of parquet files, reduce rpc cost of sending the > fragment plans to multiple drillbits, and the overhead to > serialize/deserialize the fragment plans. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DRILL-4363) Apply row count based pruning for parquet table in LIMIT n query
[ https://issues.apache.org/jira/browse/DRILL-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinfeng Ni updated DRILL-4363: -- Fix Version/s: 1.6.0 > Apply row count based pruning for parquet table in LIMIT n query > > > Key: DRILL-4363 > URL: https://issues.apache.org/jira/browse/DRILL-4363 > Project: Apache Drill > Issue Type: Improvement >Reporter: Jinfeng Ni >Assignee: Jinfeng Ni > Fix For: 1.6.0 > > > In interactive data exploration use case, one common and probably first query > that users would use is " SELECT * from table LIMIT n", where n is a small > number. Such query will give user idea about the columns in the table. > Normally, user would expect such query should be completed in very short > time, since it's just asking for small amount of rows, without any > sort/aggregation. > When table is small, there is no big problem for Drill. However, when the > table is extremely large, Drill's response time is not as fast as what user > would expect. > In case of parquet table, it seems that query planner could do a bit better > job : by applying row count based pruning for such LIMIT n query. The > pruning is kind of similar to what partition pruning will do, except that it > uses row count, in stead of partition column values. Since row count is > available in parquet table, it's possible to do such pruning. > The benefit of doing such pruning is clear: 1) for small "n", such pruning > would end up with a few parquet files, in stead of thousands, or millions of > files to scan. 2) execution probably does not have to put scan into multiple > minor fragments and start reading the files concurrently, which will cause > big IO overhead. 3) the physical plan itself is much smaller, since it does > not include the long list of parquet files, reduce rpc cost of sending the > fragment plans to multiple drillbits, and the overhead to > serialize/deserialize the fragment plans. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139840#comment-15139840 ] ASF GitHub Bot commented on DRILL-4380: --- Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/369#discussion_r52379092 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java --- @@ -233,7 +233,7 @@ private FileSelection expandSelection(DrillFileSystem fs, FileSelection selectio // /a/b/c.parquet and the format of the selection root must match that of the file names // otherwise downstream operations such as partition pruning can break. final Path metaRootPath = Path.getPathWithoutSchemeAndAuthority(metaRootDir.getPath()); -final FileSelection newSelection = FileSelection.create(null, fileNames, metaRootPath.toString()); +final FileSelection newSelection = new FileSelection(selection.getStatuses(fs), fileNames, metaRootPath.toString()); --- End diff -- Agreed. Hanifi made the change to the api initially with exactly that in mind. This patch reverts that partially. He's logging a new Jira to fix the api and document usage. > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139809#comment-15139809 ] ASF GitHub Bot commented on DRILL-4380: --- Github user jacques-n commented on a diff in the pull request: https://github.com/apache/drill/pull/369#discussion_r52376878 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java --- @@ -183,12 +194,16 @@ private static String buildPath(final String[] path, final int folderIndex) { } public static FileSelection create(final DrillFileSystem fs, final String parent, final String path) throws IOException { +Stopwatch timer = Stopwatch.createStarted(); final Path combined = new Path(parent, removeLeadingSlash(path)); final FileStatus[] statuses = fs.globStatus(combined); if (statuses == null) { return null; } -return create(Lists.newArrayList(statuses), null, combined.toUri().toString()); +final FileSelection fileSel = create(Lists.newArrayList(statuses), null, combined.toUri().toString()); +logger.info("FileSelection.create() took {} ms ", timer.elapsed(TimeUnit.MILLISECONDS)); --- End diff -- INFO => DEBUG > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139830#comment-15139830 ] ASF GitHub Bot commented on DRILL-4380: --- Github user hnfgns commented on a diff in the pull request: https://github.com/apache/drill/pull/369#discussion_r52378537 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java --- @@ -233,7 +233,7 @@ private FileSelection expandSelection(DrillFileSystem fs, FileSelection selectio // /a/b/c.parquet and the format of the selection root must match that of the file names // otherwise downstream operations such as partition pruning can break. final Path metaRootPath = Path.getPathWithoutSchemeAndAuthority(metaRootDir.getPath()); -final FileSelection newSelection = FileSelection.create(null, fileNames, metaRootPath.toString()); +final FileSelection newSelection = new FileSelection(selection.getStatuses(fs), fileNames, metaRootPath.toString()); --- End diff -- Whole point of making this c'tor non public was to centralize creation via FileSelection.create(...). Looks like we need more explicit comments over here. For this patch, a public c'tor seems not required as well. FileSelection.create(selections, null, root) should do the trick. > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3623) Hive query hangs with limit 0 clause
[ https://issues.apache.org/jira/browse/DRILL-3623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139835#comment-15139835 ] ASF GitHub Bot commented on DRILL-3623: --- Github user sudheeshkatkam commented on the pull request: https://github.com/apache/drill/pull/364#issuecomment-182089660 @StevenMPhillips This is still WIP, right? @hsuanyi and I plan to post an update soon. > Hive query hangs with limit 0 clause > > > Key: DRILL-3623 > URL: https://issues.apache.org/jira/browse/DRILL-3623 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Hive >Affects Versions: 1.1.0 > Environment: MapR cluster >Reporter: Andries Engelbrecht >Assignee: Sudheesh Katkam > Fix For: Future > > > Running a select * from hive.table limit 0 does not return (hangs). > Select * from hive.table limit 1 works fine > Hive table is about 6GB with 330 files with parquet using snappy compression. > Data types are int, bigint, string and double. > Querying directory with parquet files through the DFS plugin works fine > select * from dfs.root.`/user/hive/warehouse/database/table` limit 0; -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139810#comment-15139810 ] ASF GitHub Bot commented on DRILL-4380: --- Github user jacques-n commented on the pull request: https://github.com/apache/drill/pull/369#issuecomment-182081851 Other than INFO => DEBUG, +1 > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139802#comment-15139802 ] ASF GitHub Bot commented on DRILL-4380: --- Github user jacques-n commented on a diff in the pull request: https://github.com/apache/drill/pull/369#discussion_r52376311 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java --- @@ -118,7 +126,10 @@ public boolean apply(@Nullable FileStatus status) { } })); -return create(nonDirectories, null, selectionRoot); +final FileSelection fileSel = create(nonDirectories, null, selectionRoot); +logger.info("FileSelection.minusDirectories() took {} ms, numFiles: {}", --- End diff -- same, DEBUG seems more appropriate. > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139805#comment-15139805 ] ASF GitHub Bot commented on DRILL-4380: --- Github user jacques-n commented on a diff in the pull request: https://github.com/apache/drill/pull/369#discussion_r52376432 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetFormatPlugin.java --- @@ -233,7 +233,7 @@ private FileSelection expandSelection(DrillFileSystem fs, FileSelection selectio // /a/b/c.parquet and the format of the selection root must match that of the file names // otherwise downstream operations such as partition pruning can break. final Path metaRootPath = Path.getPathWithoutSchemeAndAuthority(metaRootDir.getPath()); -final FileSelection newSelection = FileSelection.create(null, fileNames, metaRootPath.toString()); +final FileSelection newSelection = new FileSelection(selection.getStatuses(fs), fileNames, metaRootPath.toString()); --- End diff -- It seems like we keep having issues with misuse of this interface which causes planning regressions. Do you think it makes sense to either change the api or add additional comments to make sure people aren't doing the wrong thing? > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139801#comment-15139801 ] ASF GitHub Bot commented on DRILL-4380: --- Github user jacques-n commented on a diff in the pull request: https://github.com/apache/drill/pull/369#discussion_r52376249 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java --- @@ -73,13 +75,18 @@ public String getSelectionRoot() { } public List getStatuses(final DrillFileSystem fs) throws IOException { -if (statuses == null) { +Stopwatch timer = Stopwatch.createStarted(); + +if (statuses == null) { final List newStatuses = Lists.newArrayList(); for (final String pathStr:files) { newStatuses.add(fs.getFileStatus(new Path(pathStr))); } statuses = newStatuses; } +logger.info("FileSelection.getStatuses() took {} ms, numFiles: {}", --- End diff -- DEBUG? > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139816#comment-15139816 ] ASF GitHub Bot commented on DRILL-4380: --- Github user adeneche commented on a diff in the pull request: https://github.com/apache/drill/pull/369#discussion_r52377202 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java --- @@ -118,7 +126,10 @@ public boolean apply(@Nullable FileStatus status) { } })); -return create(nonDirectories, null, selectionRoot); +final FileSelection fileSel = create(nonDirectories, null, selectionRoot); +logger.info("FileSelection.minusDirectories() took {} ms, numFiles: {}", --- End diff -- minusDirectories() ? > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4268) Possible resource leak leading to SocketException: Too many open files
[ https://issues.apache.org/jira/browse/DRILL-4268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139822#comment-15139822 ] Ian Maloney commented on DRILL-4268: I've upgraded to Drill 1.4 and I have not seen this issue yet. If it does not show up in the next few weeks, I think this has been fixed in drill 1.4 > Possible resource leak leading to SocketException: Too many open files > -- > > Key: DRILL-4268 > URL: https://issues.apache.org/jira/browse/DRILL-4268 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.2.0 > Environment: RHEL 6 running against Hive storage type >Reporter: Ian Maloney > > I have a java app accessing Drill 1.2 via JDBC, which runs 100s of counts on > various tables. No concurrency is being used. The JDBC URL uses the format: > jdbc:drill:drillbit=a-bits-hostname > Hanifi suggested I check for open file descriptors using: > lsof -a -p DRILL_PID | wc -l > which I did on the two nodes, I currently have running drill, both, before and > after restarting. > Node from JDBC connection string (which had been previously restarted): >Before: 396 >After: 396 > Other node: >Before: 14 >After: 395 > The error, "Too many open files", persists after restarting the bits. > Opened as a result of this thread: > http://mail-archives.apache.org/mod_mbox/drill-user/201601.mbox/browser -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
[ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139780#comment-15139780 ] ASF GitHub Bot commented on DRILL-4380: --- GitHub user parthchandra opened a pull request: https://github.com/apache/drill/pull/369 DRILL-4380: Fix performance regression: in creation of FileSelection … …in ParquetFormatPlugin to not set files if metadata cache is available. You can merge this pull request into a Git repository by running: $ git pull https://github.com/parthchandra/incubator-drill DRILL-4380 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/369.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #369 commit be374c12992ef581a285b0a260bb9ad037d6df92 Author: Parth Chandra Date: 2015-12-18T00:30:42Z DRILL-4380: Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available. > Fix performance regression: in creation of FileSelection in > ParquetFormatPlugin to not set files if metadata cache is available. > > > Key: DRILL-4380 > URL: https://issues.apache.org/jira/browse/DRILL-4380 > Project: Apache Drill > Issue Type: Bug >Reporter: Parth Chandra > > The regression has been caused by the changes in > 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over > empty folders consistently so that they report table not found rather than > failing.) > In ParquetFormatPlugin, the original code created a FileSelection object in > the following code: > {code} > return new FileSelection(fileNames, metaRootPath.toString(), metadata, > selection.getFileStatusList(fs)); > {code} > The selection.getFileStatusList call made an inexpensive call to > FileSelection.init(). The call was inexpensive because the > FileSelection.files member was not set and the code does not need to make an > expensive call to get the file statuses corresponding to the files in the > FileSelection.files member. > In the new code, this is replaced by > {code} > final FileSelection newSelection = FileSelection.create(null, fileNames, > metaRootPath.toString()); > return ParquetFileSelection.create(newSelection, metadata); > {code} > This sets the FileSelection.files member but not the FileSelection.statuses > member. A subsequent call to FileSelection.getStatuses ( in > ParquetGroupScan() ) now makes an expensive call to get all the statuses. > It appears that there was an implicit assumption that the > FileSelection.statuses member should be set before the FileSelection.files > member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
Parth Chandra created DRILL-4380: Summary: Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available. Key: DRILL-4380 URL: https://issues.apache.org/jira/browse/DRILL-4380 Project: Apache Drill Issue Type: Bug Reporter: Parth Chandra The regression has been caused by the changes in 367d74a65ce2871a1452361cbd13bbd5f4a6cc95 (DRILL-2618: handle queries over empty folders consistently so that they report table not found rather than failing.) In ParquetFormatPlugin, the original code created a FileSelection object in the following code: {code} return new FileSelection(fileNames, metaRootPath.toString(), metadata, selection.getFileStatusList(fs)); {code} The selection.getFileStatusList call made an inexpensive call to FileSelection.init(). The call was inexpensive because the FileSelection.files member was not set and the code does not need to make an expensive call to get the file statuses corresponding to the files in the FileSelection.files member. In the new code, this is replaced by {code} final FileSelection newSelection = FileSelection.create(null, fileNames, metaRootPath.toString()); return ParquetFileSelection.create(newSelection, metadata); {code} This sets the FileSelection.files member but not the FileSelection.statuses member. A subsequent call to FileSelection.getStatuses ( in ParquetGroupScan() ) now makes an expensive call to get all the statuses. It appears that there was an implicit assumption that the FileSelection.statuses member should be set before the FileSelection.files member is set. This assumption is no longer true. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3522) IllegalStateException from Mongo storage plugin
[ https://issues.apache.org/jira/browse/DRILL-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139462#comment-15139462 ] Jason Altekruse commented on DRILL-3522: This applied cleanly to master and did not cause any mongo test failures, I am planning to merge it to master shortly, there should be no need for a new patch to be uploaded. > IllegalStateException from Mongo storage plugin > --- > > Key: DRILL-3522 > URL: https://issues.apache.org/jira/browse/DRILL-3522 > Project: Apache Drill > Issue Type: Bug > Components: Storage - MongoDB >Affects Versions: 1.1.0 >Reporter: Adam Gilmore >Assignee: Adam Gilmore >Priority: Critical > Attachments: DRILL-3522.1.patch.txt > > > With a Mongo storage plugin enabled, we are sporadically getting the > following exception when running queries (even not against the Mongo storage > plugin): > {code} > SYSTEM ERROR: IllegalStateException: state should be: open > (org.apache.drill.exec.work.foreman.ForemanException) Unexpected exception > during fragment initialization: > org.apache.drill.common.exceptions.DrillRuntimeException: state should be: > open > org.apache.drill.exec.work.foreman.Foreman.run():253 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 > Caused By (com.google.common.util.concurrent.UncheckedExecutionException) > org.apache.drill.common.exceptions.DrillRuntimeException: state should be: > open > com.google.common.cache.LocalCache$Segment.get():2263 > com.google.common.cache.LocalCache.get():4000 > com.google.common.cache.LocalCache.getOrLoad():4004 > com.google.common.cache.LocalCache$LocalLoadingCache.get():4874 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.getSubSchemaNames():172 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.setHolder():159 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory.registerSchemas():127 > org.apache.drill.exec.store.mongo.MongoStoragePlugin.registerSchemas():86 > > org.apache.drill.exec.store.StoragePluginRegistry$DrillSchemaFactory.registerSchemas():328 > org.apache.drill.exec.ops.QueryContext.getRootSchema():165 > org.apache.drill.exec.ops.QueryContext.getRootSchema():154 > org.apache.drill.exec.ops.QueryContext.getRootSchema():142 > org.apache.drill.exec.ops.QueryContext.getNewDefaultSchema():128 > org.apache.drill.exec.planner.sql.DrillSqlWorker.():91 > org.apache.drill.exec.work.foreman.Foreman.runSQL():901 > org.apache.drill.exec.work.foreman.Foreman.run():242 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 > Caused By (org.apache.drill.common.exceptions.DrillRuntimeException) state > should be: open > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$DatabaseLoader.load():98 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$DatabaseLoader.load():82 > com.google.common.cache.LocalCache$LoadingValueReference.loadFuture():3599 > com.google.common.cache.LocalCache$Segment.loadSync():2379 > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad():2342 > com.google.common.cache.LocalCache$Segment.get():2257 > com.google.common.cache.LocalCache.get():4000 > com.google.common.cache.LocalCache.getOrLoad():4004 > com.google.common.cache.LocalCache$LocalLoadingCache.get():4874 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.getSubSchemaNames():172 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.setHolder():159 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory.registerSchemas():127 > org.apache.drill.exec.store.mongo.MongoStoragePlugin.registerSchemas():86 > > org.apache.drill.exec.store.StoragePluginRegistry$DrillSchemaFactory.registerSchemas():328 > org.apache.drill.exec.ops.QueryContext.getRootSchema():165 > org.apache.drill.exec.ops.QueryContext.getRootSchema():154 > org.apache.drill.exec.ops.QueryContext.getRootSchema():142 > org.apache.drill.exec.ops.QueryContext.getNewDefaultSchema():128 > org.apache.drill.exec.planner.sql.DrillSqlWorker.():91 > org.apache.drill.exec.work.foreman.Foreman.runSQL():901 > org.apache.drill.exec.work.foreman.Foreman.run():242 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 > Caused By (java.lang.IllegalStateException) state should be: open > com.mongodb.assertions.Assertions
[jira] [Commented] (DRILL-4373) Drill and Hive have incompatible timestamp representations in parquet
[ https://issues.apache.org/jira/browse/DRILL-4373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139445#comment-15139445 ] Rahul Challapalli commented on DRILL-4373: -- Hive itself fails to read the file with the below error (Same error from drill's hive plugin as well) {code} 2016-02-09 19:12:21,980 ERROR [main]: CliDriver (SessionState.java:printError(833)) - Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io.LongWritable java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io.LongWritable at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:153) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:1707) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:221) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:153) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:364) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:712) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:631) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:570) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io.LongWritable at org.apache.hadoop.hive.ql.exec.ListSinkOperator.processOp(ListSinkOperator.java:90) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:95) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:584) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:576) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:139) ... 12 more Caused by: java.lang.UnsupportedOperationException: Cannot inspect org.apache.hadoop.io.LongWritable at org.apache.hadoop.hive.ql.io.parquet.serde.primitive.ParquetStringInspector.getPrimitiveWritableObject(ParquetStringInspector.java:52) at org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:225) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:485) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:438) at org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(DelimitedJSONSerDe.java:71) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.doSerialize(LazySimpleSerDe.java:422) at org.apache.hadoop.hive.serde2.AbstractEncodingAwareSerDe.serialize(AbstractEncodingAwareSerDe.java:50) at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:71) at org.apache.hadoop.hive.ql.exec.DefaultFetchFormatter.convert(DefaultFetchFormatter.java:40) at org.apache.hadoop.hive.ql.exec.ListSinkOperator.processOp(ListSinkOperator.java:87) ... 19 more {code} > Drill and Hive have incompatible timestamp representations in parquet > - > > Key: DRILL-4373 > URL: https://issues.apache.org/jira/browse/DRILL-4373 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Hive, Storage - Parquet >Reporter: Rahul Challapalli > > git.commit.id.abbrev=83d460c > I created a parquet file with a timestamp type using Drill. Now if I define a > hive table on top of the parquet file and use "timestamp" as the column type, > drill fails to read the hive table through the hive storage plugin -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (DRILL-4378) CONVERT_FROM in View results in table scan of MapR-DB and perhaps HBASE
[ https://issues.apache.org/jira/browse/DRILL-4378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinfeng Ni reassigned DRILL-4378: - Assignee: Jinfeng Ni > CONVERT_FROM in View results in table scan of MapR-DB and perhaps HBASE > --- > > Key: DRILL-4378 > URL: https://issues.apache.org/jira/browse/DRILL-4378 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization, Storage - HBase >Affects Versions: 1.4.0 >Reporter: John Omernik >Assignee: Jinfeng Ni > > I created a view to avoid forcing users to write queries that always > included the CONVERT_FROM statements. (I am a huge advocate of making things > easy for the the users and writing queries with CONVERT_FROM statements isn't > easy). > I ran a query the other day on one of these views and noticed that a query > that took 30 seconds really shouldn't take 30 seconds. What do I mean? well > I wanted to get part of a record by looking up the MapR-DB Row key (equiv. to > HBASE row key) That should be an instant lookup. Sure enough, when I tried > it in the hbase shell that returns instantly. So why did Drill take 30 > seconds? I shot an email to Ted and Jim at MapR to ask this very question. > Ted suggested that I try the query without a view. Sure enough, If I use the > convert_from in a direct query, it's an instant (sub second) return. Thus it > appears something in the view is not allowing the query to short circuit the > read. > Ted suggests I post here (I am curious if anyone who has HBASE setup is > seeing this same issue with views) but also include the EXPLAIN plan. > Basically, using my very limited ability to read EXPLAIN plans (If someone > has a pointer to a blog post or docs on how to read EXPLAIN I would love > that!) it looks like in the view the startRow and stopRow in the > hbaseScanSpec are not set, seeming to cause a scan. Is there any away to > assist the planner when running this through a view so that we can get the > performance of the query without the view but with the easy of > use/readability of using the view? > Thanks!!! > John > View Creation > CREATE VIEW view_testpaste as > SELECT > CONVERT_FROM(row_key, 'UTF8') AS pasteid, > CONVERT_FROM(pastes.pdata.lang, 'UTF8') AS lang, > CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste > FROM dfs.`pastes`.`/pastes` pastes; > Select from view takes 32 seconds (seems to be a scan) > > select paste from view_testpaste where pasteid = 'djHEHcPM' > 1 row selected (32.302 seconds) > Just a direct select returns very fast (0.486 seconds) > > select CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste > FROM dfs.`pastes`.`/pastes` pastes where > CONVERT_FROM(row_key, 'UTF8') = 'djHEHcPM'; > 1 row selected (0.486 seconds) > EXPLAIN PLAN FOR select paste from view_testpaste where pasteid = 'djHEHcPM' > +--+--+ > | text | json | > +--+--+ > | 00-00Screen > 00-01 UnionExchange > 01-01Project(paste=[CONVERT_FROMUTF8($1)]) > 01-02 SelectionVectorRemover > 01-03Filter(condition=[=(CONVERT_FROMUTF8($0), 'djHEHcPM')]) > 01-04 Project(row_key=[$1], ITEM=[ITEM($0, 'paste')]) > 01-05Scan(groupscan=[MapRDBGroupScan > [HBaseScanSpec=HBaseScanSpec [tableName=maprfs:///data/pastebiner/pastes, > startRow=null, stopRow=null, filter=null], columns=[`row_key`, > `raw`.`paste`]]]) > | { > "head" : { > "version" : 1, > "generator" : { > "type" : "ExplainHandler", > "info" : "" > }, > "type" : "APACHE_DRILL_PHYSICAL", > "options" : [ ], > "queue" : 0, > "resultMode" : "EXEC" > }, > "graph" : [ { > "pop" : "maprdb-scan", > "@id" : 65541, > "userName" : "darkness", > "hbaseScanSpec" : { > "tableName" : "maprfs:///data/pastebiner/pastes", > "startRow" : "", > "stopRow" : "", > "serializedFilter" : null > }, > "storage" : { > "type" : "file", > "enabled" : true, > "connection" : "maprfs:///", > "workspaces" : { > "root" : { > "location" : "/", > "writable" : false, > "defaultInputFormat" : null > }, > "pastes" : { > "location" : "/data/pastebiner", > "writable" : true, > "defaultInputFormat" : null > }, > "dev" : { > "location" : "/data/dev", > "writable" : true, > "defaultInputFormat" : null > }, > "hive" : { > "location" : "/user/hive", > "writable" : true, > "defaultInputFormat" : null > }, > "tmp" : { > "location" : "/tmp", > "writable" : true, > "defaultInputFormat" : null > } > }, > "formats" :
[jira] [Commented] (DRILL-4378) CONVERT_FROM in View results in table scan of MapR-DB and perhaps HBASE
[ https://issues.apache.org/jira/browse/DRILL-4378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139319#comment-15139319 ] Jinfeng Ni commented on DRILL-4378: --- Some comments copied from dev list. The view case seems to have CONVERT_FROMUTF8() in the filter, which prevent from the HBase filter pushdown logic to work properly. Both the querying against view case and the direct query case used the FILTER_ON_PROJECT rule to try to push the filter into HBaseScan. However, the difference between them is the CONVERT_FROM() function. In the case of direct query, CONVERT_FROM() remains unchanged. However, in case of querying against view, CONVERT_FROM() is transformed into CONVERT_FROMUTF8(). Apparently, the HBase pushdown logic only assumes the function to be CONVERT_FROM() only, thus does not work for the View case. Here is internal log for the direct scan case: {code} FINE: Pop match: rule [HBasePushFilterIntoScan:Filter_On_Project] rels [rel#124:FilterPrel.PHYSICAL.RANDOM_DISTRIBUTED([]).[](input=rel#113:Subset#7.PHYSICAL.RANDOM_DISTRIBUTED([]).[],condition=>(CONVERT_FROM($0, 'UTF8'), 'b4')), rel#120:ProjectPrel.PHYSICAL.RANDOM_DISTRIBUTED([]).[](input=rel#107:Subset#6.PHYSICAL.ANY([]).[],row_key=$1,ITEM=ITEM($0, 'c3')), rel#103:ScanPrel.PHYSICAL.RANDOM_DISTRIBUTED([]).[](groupscan=HBaseGroupScan [HBaseScanSpec=HBaseScanSpec [tableName=TestTable1, startRow=null, stopRow=null, filter=null], columns=[`row_key`, `f2`.`c3`]])] {code} Here is the internal log for for View case: {code} FINE: Pop match: rule [HBasePushFilterIntoScan:Filter_On_Project] rels [rel#163:FilterPrel.PHYSICAL.RANDOM_DISTRIBUTED([]).[](input=rel#152:Subset#9.PHYSICAL.RANDOM_DISTRIBUTED([]).[],condition=>(CONVERT_FROMUTF8($0), 'b4')), rel#159:ProjectPrel.PHYSICAL.RANDOM_DISTRIBUTED([]).[](input=rel#146:Subset#8.PHYSICAL.ANY([]).[],row_key=$1,ITEM=ITEM($0, 'c3')), rel#142:ScanPrel.PHYSICAL.RANDOM_DISTRIBUTED([]).[](groupscan=HBaseGroupScan [HBaseScanSpec=HBaseScanSpec [tableName=TestTable1, startRow=null, stopRow=null, filter=null], columns=[`row_key`, `f2`.`c3`]])] {code} Looks like there is some timing issue of rules firing: some other logic of rewriting CONVERT_FROM() was applied before this rule kicks in, for the VIEW case. > CONVERT_FROM in View results in table scan of MapR-DB and perhaps HBASE > --- > > Key: DRILL-4378 > URL: https://issues.apache.org/jira/browse/DRILL-4378 > Project: Apache Drill > Issue Type: Bug > Components: Query Planning & Optimization, Storage - HBase >Affects Versions: 1.4.0 >Reporter: John Omernik > > I created a view to avoid forcing users to write queries that always > included the CONVERT_FROM statements. (I am a huge advocate of making things > easy for the the users and writing queries with CONVERT_FROM statements isn't > easy). > I ran a query the other day on one of these views and noticed that a query > that took 30 seconds really shouldn't take 30 seconds. What do I mean? well > I wanted to get part of a record by looking up the MapR-DB Row key (equiv. to > HBASE row key) That should be an instant lookup. Sure enough, when I tried > it in the hbase shell that returns instantly. So why did Drill take 30 > seconds? I shot an email to Ted and Jim at MapR to ask this very question. > Ted suggested that I try the query without a view. Sure enough, If I use the > convert_from in a direct query, it's an instant (sub second) return. Thus it > appears something in the view is not allowing the query to short circuit the > read. > Ted suggests I post here (I am curious if anyone who has HBASE setup is > seeing this same issue with views) but also include the EXPLAIN plan. > Basically, using my very limited ability to read EXPLAIN plans (If someone > has a pointer to a blog post or docs on how to read EXPLAIN I would love > that!) it looks like in the view the startRow and stopRow in the > hbaseScanSpec are not set, seeming to cause a scan. Is there any away to > assist the planner when running this through a view so that we can get the > performance of the query without the view but with the easy of > use/readability of using the view? > Thanks!!! > John > View Creation > CREATE VIEW view_testpaste as > SELECT > CONVERT_FROM(row_key, 'UTF8') AS pasteid, > CONVERT_FROM(pastes.pdata.lang, 'UTF8') AS lang, > CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste > FROM dfs.`pastes`.`/pastes` pastes; > Select from view takes 32 seconds (seems to be a scan) > > select paste from view_testpaste where pasteid = 'djHEHcPM' > 1 row selected (32.302 seconds) > Just a direct select returns very fast (0.486 seconds) > > select CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste >
[jira] [Commented] (DRILL-3581) Google Guava version is so old it causes incompatibilities with other libs
[ https://issues.apache.org/jira/browse/DRILL-3581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139235#comment-15139235 ] Vincent Uribe commented on DRILL-3581: -- Won't this be available before 1.6? > Google Guava version is so old it causes incompatibilities with other libs > -- > > Key: DRILL-3581 > URL: https://issues.apache.org/jira/browse/DRILL-3581 > Project: Apache Drill > Issue Type: Bug > Components: Client - JDBC >Affects Versions: 1.1.0 > Environment: Linux, JDK 1.8 >Reporter: Joseph Barefoot >Assignee: Steven Phillips > Fix For: 1.6.0 > > > Drill is currently using Guava version 14.0.1, which was released March 2013. > https://github.com/apache/drill/blob/master/pom.xml > Many other java projects use newer versions, however this conflicts with the > Drill JDBC driver since a couple of APIs it uses are incompatible with the > newer guava versions. In particular: > https://github.com/apache/drill/blob/master/common/src/main/java/org/apache/drill/common/util/PathScanner.java > (The public StopWatch class constructor has been removed in favor of factory > methods) > Although this seems minor, it prevents easily using Drill from a java > application, since again many other open source libs will be using the latest > Guava version (18). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DRILL-3522) IllegalStateException from Mongo storage plugin
[ https://issues.apache.org/jira/browse/DRILL-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139178#comment-15139178 ] Jason Altekruse commented on DRILL-3522: +1 Thanks for the fix Adam, sorry this sat out for so long. Please feel free to reach out on the list if you are waiting for a review. I am trying to get the JIRA cleaned up so that the list of REVIEWABLE jiras actually reflects what needs to currently be reviewed and hope this will be less of an issue going forward. > IllegalStateException from Mongo storage plugin > --- > > Key: DRILL-3522 > URL: https://issues.apache.org/jira/browse/DRILL-3522 > Project: Apache Drill > Issue Type: Bug > Components: Storage - MongoDB >Affects Versions: 1.1.0 >Reporter: Adam Gilmore >Assignee: Adam Gilmore >Priority: Critical > Attachments: DRILL-3522.1.patch.txt > > > With a Mongo storage plugin enabled, we are sporadically getting the > following exception when running queries (even not against the Mongo storage > plugin): > {code} > SYSTEM ERROR: IllegalStateException: state should be: open > (org.apache.drill.exec.work.foreman.ForemanException) Unexpected exception > during fragment initialization: > org.apache.drill.common.exceptions.DrillRuntimeException: state should be: > open > org.apache.drill.exec.work.foreman.Foreman.run():253 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 > Caused By (com.google.common.util.concurrent.UncheckedExecutionException) > org.apache.drill.common.exceptions.DrillRuntimeException: state should be: > open > com.google.common.cache.LocalCache$Segment.get():2263 > com.google.common.cache.LocalCache.get():4000 > com.google.common.cache.LocalCache.getOrLoad():4004 > com.google.common.cache.LocalCache$LocalLoadingCache.get():4874 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.getSubSchemaNames():172 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.setHolder():159 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory.registerSchemas():127 > org.apache.drill.exec.store.mongo.MongoStoragePlugin.registerSchemas():86 > > org.apache.drill.exec.store.StoragePluginRegistry$DrillSchemaFactory.registerSchemas():328 > org.apache.drill.exec.ops.QueryContext.getRootSchema():165 > org.apache.drill.exec.ops.QueryContext.getRootSchema():154 > org.apache.drill.exec.ops.QueryContext.getRootSchema():142 > org.apache.drill.exec.ops.QueryContext.getNewDefaultSchema():128 > org.apache.drill.exec.planner.sql.DrillSqlWorker.():91 > org.apache.drill.exec.work.foreman.Foreman.runSQL():901 > org.apache.drill.exec.work.foreman.Foreman.run():242 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 > Caused By (org.apache.drill.common.exceptions.DrillRuntimeException) state > should be: open > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$DatabaseLoader.load():98 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$DatabaseLoader.load():82 > com.google.common.cache.LocalCache$LoadingValueReference.loadFuture():3599 > com.google.common.cache.LocalCache$Segment.loadSync():2379 > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad():2342 > com.google.common.cache.LocalCache$Segment.get():2257 > com.google.common.cache.LocalCache.get():4000 > com.google.common.cache.LocalCache.getOrLoad():4004 > com.google.common.cache.LocalCache$LocalLoadingCache.get():4874 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.getSubSchemaNames():172 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.setHolder():159 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory.registerSchemas():127 > org.apache.drill.exec.store.mongo.MongoStoragePlugin.registerSchemas():86 > > org.apache.drill.exec.store.StoragePluginRegistry$DrillSchemaFactory.registerSchemas():328 > org.apache.drill.exec.ops.QueryContext.getRootSchema():165 > org.apache.drill.exec.ops.QueryContext.getRootSchema():154 > org.apache.drill.exec.ops.QueryContext.getRootSchema():142 > org.apache.drill.exec.ops.QueryContext.getNewDefaultSchema():128 > org.apache.drill.exec.planner.sql.DrillSqlWorker.():91 > org.apache.drill.exec.work.foreman.Foreman.runSQL():901 > org.apache.drill.exec.work.foreman.Foreman.run():242 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():
[jira] [Commented] (DRILL-3759) Make partition pruning multi-phased to reduce the working set kept in memory
[ https://issues.apache.org/jira/browse/DRILL-3759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138971#comment-15138971 ] John Omernik commented on DRILL-3759: - I am tossing my +1 on this JIRA any sort of larger tables in Parquet, where you are running accross multiple subdirectories (consider a years worth of data) with multiple files (even say 25 files per day) would result is excessive query planning do to the parser reviewing all files. Even when a prune operation is requested by the user, causing a very poor user experience on these tables. > Make partition pruning multi-phased to reduce the working set kept in memory > > > Key: DRILL-3759 > URL: https://issues.apache.org/jira/browse/DRILL-3759 > Project: Apache Drill > Issue Type: Improvement > Components: Query Planning & Optimization >Affects Versions: 1.1.0 >Reporter: Aman Sinha >Assignee: Mehant Baid > Fix For: Future > > > Currently, partition pruning gets all file names in the table and applies the > pruning. Suppose the files are spread out over several directories and there > is a filter on dirN, this is not efficient - both in terms of elapsed time > and memory usage. This has been seen in a few use cases recently. > Wherever possible, we should ideally perform the pruning in N steps (where N > is the number of directory levels referenced in the filter conditions): > 1. Get the directory and filenames at level i > 2. Materialize into the in-memory table > 3. Apply interpreter-based evaluation of filter condition > 4. Determine qualifying directories, increment i and repeat from step 1 > > This multi phase approach may not be possible for certain types of filters - > e,g for disjunctions. This analysis needs to be done. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-4379) Unexpected Table Behavior with only one subdirectory vs. Many
John Omernik created DRILL-4379: --- Summary: Unexpected Table Behavior with only one subdirectory vs. Many Key: DRILL-4379 URL: https://issues.apache.org/jira/browse/DRILL-4379 Project: Apache Drill Issue Type: Bug Components: Query Planning & Optimization Affects Versions: 1.4.0 Reporter: John Omernik A common practice is to use directories below a main directory as a partitioning device. Say you have a table named "myawesomedata" and you get data into that table every day, it would be valuable to create the main directory, then subdirectories per day to help optimize queries running against only certain days of data. /myawesomedata/ /myawesomedata/2016-02-01 /myawesomedata/2016-02-02 /myawesomedata/2016-02-03 /myawesomedata/2016-02-04 I have identified a condition that if there is ONLY one subdirectory, queries do not return results as expected by a user. Example: In the above, if I run a query of select count(1) from `myawesomedata`; I get accurate results of the count in all subdirectories If I run: select count(1) from `myawesomedata` where dir0 = '2016-02-01'; I get accurate results of the count of only the subdirectory 2016-02-01 However, if I delete subdirectories 2016-02-02, 2016-02-03, and 2016-02-04 and am left with: /myawesomedata/ /myawesomedata/2016-02-01 Then if I run select count(1) from `myawesomedata`; It returns the accurate count (which is just that of the 2016-02-01 directory). However, if I run select count(1) from `myawesomedata` where dir0 = '2016-02-01'; It takes much longer (15 seconds vs instant on the other queries) and returns no results. Even though this is the same query as above that worked with 2 or more subdirectories. Basically, when there is only one subdirectory, a query asking for only that directory does not work in the same way as when there are more subdirectories. This is an unexpected user experience and something I believe could cause user frustration and unexpected results from Drill usage on data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DRILL-4378) CONVERT_FROM in View results in table scan of MapR-DB and perhaps HBASE
John Omernik created DRILL-4378: --- Summary: CONVERT_FROM in View results in table scan of MapR-DB and perhaps HBASE Key: DRILL-4378 URL: https://issues.apache.org/jira/browse/DRILL-4378 Project: Apache Drill Issue Type: Bug Components: Query Planning & Optimization, Storage - HBase Affects Versions: 1.4.0 Reporter: John Omernik I created a view to avoid forcing users to write queries that always included the CONVERT_FROM statements. (I am a huge advocate of making things easy for the the users and writing queries with CONVERT_FROM statements isn't easy). I ran a query the other day on one of these views and noticed that a query that took 30 seconds really shouldn't take 30 seconds. What do I mean? well I wanted to get part of a record by looking up the MapR-DB Row key (equiv. to HBASE row key) That should be an instant lookup. Sure enough, when I tried it in the hbase shell that returns instantly. So why did Drill take 30 seconds? I shot an email to Ted and Jim at MapR to ask this very question. Ted suggested that I try the query without a view. Sure enough, If I use the convert_from in a direct query, it's an instant (sub second) return. Thus it appears something in the view is not allowing the query to short circuit the read. Ted suggests I post here (I am curious if anyone who has HBASE setup is seeing this same issue with views) but also include the EXPLAIN plan. Basically, using my very limited ability to read EXPLAIN plans (If someone has a pointer to a blog post or docs on how to read EXPLAIN I would love that!) it looks like in the view the startRow and stopRow in the hbaseScanSpec are not set, seeming to cause a scan. Is there any away to assist the planner when running this through a view so that we can get the performance of the query without the view but with the easy of use/readability of using the view? Thanks!!! John View Creation CREATE VIEW view_testpaste as SELECT CONVERT_FROM(row_key, 'UTF8') AS pasteid, CONVERT_FROM(pastes.pdata.lang, 'UTF8') AS lang, CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste FROM dfs.`pastes`.`/pastes` pastes; Select from view takes 32 seconds (seems to be a scan) > select paste from view_testpaste where pasteid = 'djHEHcPM' 1 row selected (32.302 seconds) Just a direct select returns very fast (0.486 seconds) > select CONVERT_FROM(pastes.raw.paste, 'UTF8') AS paste FROM dfs.`pastes`.`/pastes` pastes where CONVERT_FROM(row_key, 'UTF8') = 'djHEHcPM'; 1 row selected (0.486 seconds) EXPLAIN PLAN FOR select paste from view_testpaste where pasteid = 'djHEHcPM' +--+--+ | text | json | +--+--+ | 00-00Screen 00-01 UnionExchange 01-01Project(paste=[CONVERT_FROMUTF8($1)]) 01-02 SelectionVectorRemover 01-03Filter(condition=[=(CONVERT_FROMUTF8($0), 'djHEHcPM')]) 01-04 Project(row_key=[$1], ITEM=[ITEM($0, 'paste')]) 01-05Scan(groupscan=[MapRDBGroupScan [HBaseScanSpec=HBaseScanSpec [tableName=maprfs:///data/pastebiner/pastes, startRow=null, stopRow=null, filter=null], columns=[`row_key`, `raw`.`paste`]]]) | { "head" : { "version" : 1, "generator" : { "type" : "ExplainHandler", "info" : "" }, "type" : "APACHE_DRILL_PHYSICAL", "options" : [ ], "queue" : 0, "resultMode" : "EXEC" }, "graph" : [ { "pop" : "maprdb-scan", "@id" : 65541, "userName" : "darkness", "hbaseScanSpec" : { "tableName" : "maprfs:///data/pastebiner/pastes", "startRow" : "", "stopRow" : "", "serializedFilter" : null }, "storage" : { "type" : "file", "enabled" : true, "connection" : "maprfs:///", "workspaces" : { "root" : { "location" : "/", "writable" : false, "defaultInputFormat" : null }, "pastes" : { "location" : "/data/pastebiner", "writable" : true, "defaultInputFormat" : null }, "dev" : { "location" : "/data/dev", "writable" : true, "defaultInputFormat" : null }, "hive" : { "location" : "/user/hive", "writable" : true, "defaultInputFormat" : null }, "tmp" : { "location" : "/tmp", "writable" : true, "defaultInputFormat" : null } }, "formats" : { "psv" : { "type" : "text", "extensions" : [ "tbl" ], "delimiter" : "|" }, "csv" : { "type" : "text", "extensions" : [ "csv" ], "escape" : "`", "delimiter" : "," }, "tsv" : { "type" : "text", "extensions" : [ "tsv" ], "delimiter" : "\t" }, "parquet" : { "type" : "parquet" },
[jira] [Commented] (DRILL-3522) IllegalStateException from Mongo storage plugin
[ https://issues.apache.org/jira/browse/DRILL-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15138868#comment-15138868 ] B Anil Kumar commented on DRILL-3522: - +1 on this patch. [~dragoncurve] Can you please rebase this patch? > IllegalStateException from Mongo storage plugin > --- > > Key: DRILL-3522 > URL: https://issues.apache.org/jira/browse/DRILL-3522 > Project: Apache Drill > Issue Type: Bug > Components: Storage - MongoDB >Affects Versions: 1.1.0 >Reporter: Adam Gilmore >Assignee: Adam Gilmore >Priority: Critical > Attachments: DRILL-3522.1.patch.txt > > > With a Mongo storage plugin enabled, we are sporadically getting the > following exception when running queries (even not against the Mongo storage > plugin): > {code} > SYSTEM ERROR: IllegalStateException: state should be: open > (org.apache.drill.exec.work.foreman.ForemanException) Unexpected exception > during fragment initialization: > org.apache.drill.common.exceptions.DrillRuntimeException: state should be: > open > org.apache.drill.exec.work.foreman.Foreman.run():253 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 > Caused By (com.google.common.util.concurrent.UncheckedExecutionException) > org.apache.drill.common.exceptions.DrillRuntimeException: state should be: > open > com.google.common.cache.LocalCache$Segment.get():2263 > com.google.common.cache.LocalCache.get():4000 > com.google.common.cache.LocalCache.getOrLoad():4004 > com.google.common.cache.LocalCache$LocalLoadingCache.get():4874 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.getSubSchemaNames():172 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.setHolder():159 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory.registerSchemas():127 > org.apache.drill.exec.store.mongo.MongoStoragePlugin.registerSchemas():86 > > org.apache.drill.exec.store.StoragePluginRegistry$DrillSchemaFactory.registerSchemas():328 > org.apache.drill.exec.ops.QueryContext.getRootSchema():165 > org.apache.drill.exec.ops.QueryContext.getRootSchema():154 > org.apache.drill.exec.ops.QueryContext.getRootSchema():142 > org.apache.drill.exec.ops.QueryContext.getNewDefaultSchema():128 > org.apache.drill.exec.planner.sql.DrillSqlWorker.():91 > org.apache.drill.exec.work.foreman.Foreman.runSQL():901 > org.apache.drill.exec.work.foreman.Foreman.run():242 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 > Caused By (org.apache.drill.common.exceptions.DrillRuntimeException) state > should be: open > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$DatabaseLoader.load():98 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$DatabaseLoader.load():82 > com.google.common.cache.LocalCache$LoadingValueReference.loadFuture():3599 > com.google.common.cache.LocalCache$Segment.loadSync():2379 > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad():2342 > com.google.common.cache.LocalCache$Segment.get():2257 > com.google.common.cache.LocalCache.get():4000 > com.google.common.cache.LocalCache.getOrLoad():4004 > com.google.common.cache.LocalCache$LocalLoadingCache.get():4874 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.getSubSchemaNames():172 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory$MongoSchema.setHolder():159 > > org.apache.drill.exec.store.mongo.schema.MongoSchemaFactory.registerSchemas():127 > org.apache.drill.exec.store.mongo.MongoStoragePlugin.registerSchemas():86 > > org.apache.drill.exec.store.StoragePluginRegistry$DrillSchemaFactory.registerSchemas():328 > org.apache.drill.exec.ops.QueryContext.getRootSchema():165 > org.apache.drill.exec.ops.QueryContext.getRootSchema():154 > org.apache.drill.exec.ops.QueryContext.getRootSchema():142 > org.apache.drill.exec.ops.QueryContext.getNewDefaultSchema():128 > org.apache.drill.exec.planner.sql.DrillSqlWorker.():91 > org.apache.drill.exec.work.foreman.Foreman.runSQL():901 > org.apache.drill.exec.work.foreman.Foreman.run():242 > java.util.concurrent.ThreadPoolExecutor.runWorker():1145 > java.util.concurrent.ThreadPoolExecutor$Worker.run():615 > java.lang.Thread.run():745 > Caused By (java.lang.IllegalStateException) state should be: open > com.mongodb.assertions.Assertions.isTrue():70 > com.mongodb.connection.BaseCluster.selectServer():79 > com.mongodb.binding.ClusterBinding$Cl
[jira] [Assigned] (DRILL-2282) Eliminate spaces, special characters from names in function templates
[ https://issues.apache.org/jira/browse/DRILL-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vitalii Diravka reassigned DRILL-2282: -- Assignee: Vitalii Diravka (was: Mehant Baid) > Eliminate spaces, special characters from names in function templates > - > > Key: DRILL-2282 > URL: https://issues.apache.org/jira/browse/DRILL-2282 > Project: Apache Drill > Issue Type: Bug > Components: Functions - Drill >Reporter: Mehant Baid >Assignee: Vitalii Diravka > Fix For: 1.6.0 > > Attachments: DRILL-2282.patch > > > Having spaces in the name of the functions causes issues while deserializing > such expressions when we try to read the plan fragment. As part of this JIRA > would like to clean up all the templates to not include special characters in > their names. -- This message was sent by Atlassian JIRA (v6.3.4#6332)