I won't be the best source for explaining why the flag worked, but this thread should help explain why performance is expected to be better when using the Native Parquet reader.
https://lists.apache.org/thread.html/6429051d5babb87d3b03494524c1802f75d572d630cb5690fd616741@<user.drill.apache.org> That said, there is work in progress to improve performance for Parquet readers in general, including complex data. -----Original Message----- From: Anup Tiwari [mailto:[email protected]] Sent: Wednesday, February 14, 2018 2:09 AM To: [email protected] Subject: Re: Reading drill(1.10.0) created parquet table in hive(2.1.1) using external table Hi Kunal, That issue was related to container size which is resolved and now its working.However i was trying vice-versa which is a table is created in hive(2.1.1)/hadoop(2.7.3) and stored on s3 and i am trying to read it via drill(1.10.0). So initially when i was querying parquet data stored on s3 i was getting below error which i have resolved by setting `store.parquet.use_new_reader`=true :- ERROR o.a.d.e.w.fragment.FragmentExecutor - SYSTEM ERROR: TProtocolException: don't know what type: 15 Fragment 1:2 [Error Id: 43369db3-532a-4004-b966-7fbf42b84cc8 on prod-hadoop-102.bom-prod.aws.games24x7.com:31010] org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: TProtocolException: don't know what type: 15 Fragment 1:2 [Error Id: 43369db3-532a-4004-b966-7fbf42b84cc8 on prod-hadoop-102.bom-prod.aws.games24x7.com:31010] at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544) ~[drill-common-1.10.0.jar:1.10.0] at org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:293) [drill-java-exec-1.10.0.jar:1.10.0] at org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160) [drill-java-exec-1.10.0.jar:1.10.0] at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:262) [drill-java-exec-1.10.0.jar:1.10.0] at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.10.0.jar:1.10.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_72] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_72] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72] Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Error in parquet record reader. But while searching for above issue i found somewhere that setting `store.parquet.use_new_reader`=true impact query performance. can you provide any details on this ? Also post setting this i am able to query files created by hive. But when i am executing a big query on files then i am getting below error :- org.apache.drill.common.exceptions.UserException: SYSTEM ERROR: ConnectionPoolTimeoutException: Timeout waiting for connection from pool Fragment 3:14 [Error Id: 0564e2e4-c917-489c-8a54-2a623401563c on prod-hadoop-102.bom-prod.aws.games24x7.com:31010] at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544) ~[drill-common-1.10.0.jar:1.10.0] at org.apache.drill.exec.work.fragment.FragmentExecutor.sendFinalState(FragmentExecutor.java:293) [drill-java-exec-1.10.0.jar:1.10.0] at org.apache.drill.exec.work.fragment.FragmentExecutor.cleanup(FragmentExecutor.java:160) [drill-java-exec-1.10.0.jar:1.10.0] at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:262) [drill-java-exec-1.10.0.jar:1.10.0] at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.10.0.jar:1.10.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_72] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_72] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_72] Caused by: org.apache.drill.common.exceptions.DrillRuntimeException: Error in drill parquet reader (complex).Message: Failure in setting up reader Caused by: com.amazonaws.AmazonClientException: Unable to execute HTTP request: Timeout waiting for connection from pool at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:454) ~[aws-java-sdk-1.7.4.jar:na] at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) ~[aws-java-sdk-1.7.4.jar:na] at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) ~[aws-java-sdk-1.7.4.jar:na] at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111) ~[aws-java-sdk-1.7.4.jar:na] at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:91) ~[hadoop-aws-2.7.1.jar:na] at org.apache.hadoop.fs.s3a.S3AInputStream.seek(S3AInputStream.java:115) ~[hadoop-aws-2.7.1.jar:na] at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62) ~[hadoop-common-2.7.1.jar:na] at org.apache.drill.exec.store.dfs.DrillFSDataInputStream.seek(DrillFSDataInputStream.java:57) ~[drill-java-exec-1.10.0.jar:1.10.0] at org.apache.parquet.hadoop.ColumnChunkIncReadStore.addColumn(ColumnChunkIncReadStore.java:245) ~[drill-java-exec-1.10.0.jar:1.8.1-drill-r0] at org.apache.drill.exec.store.parquet2.DrillParquetReader.setup(DrillParquetReader.java:261) ~[drill-java-exec-1.10.0.jar:1.10.0] ... 16 common frames omittedCaused by: org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:232) ~[httpclient-4.2.5.jar:4.2.5] at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:199) ~[httpclient-4.2.5.jar:4.2.5] at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[na:na] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_72] at java.lang.reflect.Method.invoke(Method.java:498) ~[na:1.8.0_72] at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) ~[aws-java-sdk-1.7.4.jar:na] at com.amazonaws.http.conn.$Proxy79.getConnection(Unknown Source) ~[na:na] at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:456) ~[httpclient-4.2.5.jar:4.2.5] at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) ~[httpclient-4.2.5.jar:4.2.5] at org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) ~[httpclient-4.2.5.jar:4.2.5] at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384) ~[aws-java-sdk-1.7.4.jar:na] ... 25 common frames omitted Note :- The parquet file which i want to access contains 43 column with all column are of the below type :-"optional binary col1 (UTF8);" except one which is "optional int32 col2"; On Tue, Feb 13, 2018 10:59 PM, Kunal Khatua [email protected] wrote: Can you share what the error is? Without that, it is anybody's guess on what the issue is. -----Original Message----- From: Anup Tiwari [mailto:[email protected]] Sent: Tuesday, February 13, 2018 6:19 AM To: [email protected] Subject: Reading drill(1.10.0) created parquet table in hive(2.1.1) using external table Hi Team, I am trying to read drill(1.10.0) created parquet table in hive(2.1.1) using external table and getting some error which seems not related to drill. Just asking anyone have tried this ? If yes then do we have any best practices/link for this? Regards, Anup Tiwari Regards, Anup Tiwari
