Hello everyone, I am currently using apache drill to query a directory of parquet files (total size 64gb with each file averaging 6.4Gb per size). I am running the following query
select * from dfs.rj.vw_reports_201805 where unique_key in ( select unique_key from dfs.rj.vw_reports_201805 group by unique_key having count(unique_key) > 1 ) limit 40 Now this query takes about 42 minutes after which it fails because one of the drillbits in the drillbit cluster crashes. I check the logs and I see several errors summarized as follows 20:35:18.006 [Curator-Framework-0] ERROR org.apache.curator.ConnectionState - Connection timed out for connection string (172.35.15.129:5181,172.35.15.151:5181,172.35.15.152:5181) and timeout (5000) / elapsed (13347) org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) [curator-client-2.7.1.jar:na] at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) [curator-client-2.7.1.jar:na] at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115) [curator-client-2.7.1.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806) [curator-framework-2.7.1.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792) [curator-framework-2.7.1.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62) [curator-framework-2.7.1.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257) [curator-framework-2.7.1.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_171] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_171] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_171] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_171] AND Aug 1, 2018 7:54:18 AM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr (build 6aa21f8776625b5fa6b18059cfebe7549f2e00cb) org.apache.parquet.VersionParser$VersionParseException: Could not parse created_by: parquet-mr (build 6aa21f8776625b5fa6b18059cfebe7549f2e00cb) using format: (.+) version ((.*) )?\(build ?(.*)\) at org.apache.parquet.VersionParser.parse(VersionParser.java:112) at org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:66) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:264) at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:568) at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:545) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:455) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:404) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:390) at org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:117) at org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:53) at org.apache.drill.exec.physical.impl.ImplCreator$2.run(ImplCreator.java:146) at org.apache.drill.exec.physical.impl.ImplCreator$2.run(ImplCreator.java:142) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1633) at org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:142) at org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:182) at org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreator.java:110) at org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:87) at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:206) at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) The Schema for the data I am querying in question is as follows index_no ANY YES bank_code ANY YES trxn_category ANY YES Debit_Account_Type ANY YES Credit_Account_Type ANY YES trxn_amount ANY YES trxn_fee ANY YES trxn_date ANY YES Currency ANY YES Late_reversal ANY YES card_type ANY YES terminal_type ANY YES Acquirer ANY YES Issuer ANY YES Value_SettleAmount ANY YES Value_RequestedAmount ANY YES Volume ANY YES unique_key ANY YES source_node_name ANY YES business_date ANY YES PT_datetime_tran_local ANY YES PTC_pan ANY YES PTC_terminal_id ANY YES PTC_card_acceptor_id_code ANY YES PTC_card_acceptor_name_loc ANY YES PT_system_trace_audit_nr ANY YES PT_message_type ANY YES PT_from_account_id ANY YES PT_to_account_id ANY YES PT_tran_type ANY YES PT_rsp_code_req ANY YES PT_rsp_code_rsp ANY YES PT_settle_amount_req ANY YES PT_settle_amount_rsp ANY YES PTC_merchant_type ANY YES PT_settle_amount_impact ANY YES PT_settle_tran_fee_req ANY YES PT_settle_tran_fee_rsp ANY YES PT_auth_id_rsp ANY YES PT_retrieval_reference_nr ANY YES PTC_totals_group ANY YES PTC_card_product ANY YES PT_tran_currency_code ANY YES PT_payee ANY YES PT_tran_amount_req ANY YES PT_tran_amount_rsp ANY YES PT_extended_tran_type ANY YES PT_tran_postilion_originated ANY YES PT_tran_completed ANY YES PT_tran_nr ANY YES PT_retention_data ANY YES PT_acquiring_inst_id_code ANY YES PT_message_reason_code ANY YES PT_sponsor_bank ANY YES PT_datetime_tran_gmt ANY YES PT_datetime_req ANY YES PT_datetime_rsp ANY YES PT_realtime_business_date ANY YES PT_recon_business_date ANY YES PT_from_account_type ANY YES PT_to_account_type ANY YES PT_tran_cash_req ANY YES PT_tran_cash_rsp ANY YES PT_tran_tran_fee_req ANY YES PT_tran_tran_fee_rsp ANY YES PT_tran_tran_fee_currency_code ANY YES PT_tran_proc_fee_req ANY YES PT_tran_proc_fee_rsp ANY YES PT_tran_proc_fee_currency_code ANY YES PT_settle_currency_code ANY YES PT_pos_entry_mode ANY YES PT_pos_condition_code ANY YES PT_tran_reversed ANY YES PT_card_verification_result ANY YES PT_online_system_id ANY YES PT_participant_id ANY YES PT_receiving_inst_id_code ANY YES PT_routing_type ANY YES PT_pt_pos_card_input_mode ANY YES PT_source_node_key ANY YES PT_proc_online_system_id ANY YES PTC_post_tran_cust_id ANY YES PTC_source_node_name ANY YES PTC_card_seq_nr ANY YES PTC_expiry_date ANY YES PTC_service_restriction_code ANY YES PTC_terminal_owner ANY YES PTC_mapped_card_acceptor_id_code ANY YES PTC_pos_terminal_type ANY YES PTC_pan_encrypted ANY YES PTSP_Account_Nr ANY YES PTSP_Code ANY YES account_PTSP_Code ANY YES PTSP_Name ANY YES Sort_Code ANY YES rdm_amt ANY YES Reward_Code ANY YES Reward_discount ANY YES ptsp_terminal_id ANY YES reward_terminal_id ANY YES txn_id ANY YES web_category_code ANY YES web_category_name ANY YES web_fee_type ANY YES web_merchant_disc ANY YES web_fee_cap ANY YES Account_Name ANY YES account_nr ANY YES Acquiring_bank ANY YES journal_amount ANY YES xls_amount ANY YES merch_cat_amount_cap ANY YES merch_cat_visa_amount_cap ANY YES reward_amount_cap ANY YES Amount_config_state ANY YES Amount_description ANY YES Authorized_Person ANY YES ACC_BANK_CODE ANY YES merch_cat_bearer ANY YES merch_cat_visa_bearer ANY YES merch_cat_visa_category_code ANY YES merch_cat_category_code ANY YES merch_cat_visa_category_name ANY YES merch_cat_category_name ANY YES credit_acc_id ANY YES credit_acc_nr_id ANY YES credit_cardholder_acc_id ANY YES credit_cardholder_acc_type ANY YES CreditAccNr_acc_id ANY YES CreditAccNr_acc_nr ANY YES CreditAccNr_acc_nr_id ANY YES CreditAccNr_se_id ANY YES CreditAccNr_state ANY YES Date_Modified ANY YES debit_acc_id ANY YES debit_acc_nr_id ANY YES debit_cardholder_acc_id ANY YES debit_cardholder_acc_type ANY YES DebitAccNr_acc_id ANY YES DebitAccNr_acc_nr ANY YES DebitAccNr_acc_nr_id ANY YES extended_trans_type ANY YES fee ANY YES Fee_amount_id ANY YES merch_cat_fee_cap ANY YES merch_cat_visa_fee_cap ANY YES reward_fee_cap ANY YES Fee_description ANY YES Fee_Discount ANY YES Fee_fee_id ANY YES fee_id ANY YES Fee_name ANY YES Fee_se_id ANY YES merch_cat_category_fee_type ANY YES merch_cat_category_visa_fee_type ANY YES merch_cat_category_merch_discount ANY YES merch_cat_category_visa_merch_discount ANY YES PT_sink_node_name ANY YES post_tran_id ANY YES Miscellaneous ANY YES terminal_owner ANY YES terminal_owner_name ANY YES REF_rpt_account_type ANY YES REF_IsPoolAccount ANY YES From_Account_Type ANY YES To_Account_Type ANY YES Tran_type_description ANY YES Region ANY YES Transaction_Status ANY YES Transaction_type_Impact ANY YES Message_Type_Desc ANY YES Response_Code_description ANY YES REF_BatchId ANY YES Beneficiary_Account ANY YES Rate ANY YES final_fee ANY YES final_fee_type ANY YES settlement_date ANY YES Please what can I do to improve performance of this query