Hello everyone,

I am currently using apache drill to query a directory of parquet files (total 
size 64gb with each file averaging 6.4Gb per size).  I am running the following 
query

select * from dfs.rj.vw_reports_201805 where unique_key in
    (
         select unique_key from dfs.rj.vw_reports_201805 group by unique_key 
having count(unique_key) > 1
    )
limit 40

Now this query takes about 42 minutes after which it fails because one of the 
drillbits in the drillbit cluster crashes. I check the logs and I see several 
errors summarized as follows

20:35:18.006 [Curator-Framework-0] ERROR org.apache.curator.ConnectionState - 
Connection timed out for connection string 
(172.35.15.129:5181,172.35.15.151:5181,172.35.15.152:5181) and timeout (5000) / 
elapsed (13347)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = 
ConnectionLoss
        at 
org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197) 
[curator-client-2.7.1.jar:na]
        at 
org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87) 
[curator-client-2.7.1.jar:na]
        at 
org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
 [curator-client-2.7.1.jar:na]
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:806)
 [curator-framework-2.7.1.jar:na]
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:792)
 [curator-framework-2.7.1.jar:na]
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:62)
 [curator-framework-2.7.1.jar:na]
        at 
org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:257)
 [curator-framework-2.7.1.jar:na]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
[na:1.8.0_171]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[na:1.8.0_171]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[na:1.8.0_171]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_171]
AND


Aug 1, 2018 7:54:18 AM WARNING: org.apache.parquet.CorruptStatistics: Ignoring 
statistics because created_by could not be parsed (see PARQUET-251): parquet-mr 
(build 6aa21f8776625b5fa6b18059cfebe7549f2e00cb)

org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_by: parquet-mr (build 6aa21f8776625b5fa6b18059cfebe7549f2e00cb) using 
format: (.+) version ((.*) )?\(build ?(.*)\)

        at org.apache.parquet.VersionParser.parse(VersionParser.java:112)

        at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:66)

        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:264)

        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:568)

        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:545)

        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:455)

        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:404)

        at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:390)

        at 
org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:117)

        at 
org.apache.drill.exec.store.parquet.ParquetScanBatchCreator.getBatch(ParquetScanBatchCreator.java:53)

        at 
org.apache.drill.exec.physical.impl.ImplCreator$2.run(ImplCreator.java:146)

        at 
org.apache.drill.exec.physical.impl.ImplCreator$2.run(ImplCreator.java:142)

        at java.security.AccessController.doPrivileged(Native Method)

        at javax.security.auth.Subject.doAs(Subject.java:422)

        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1633)

        at 
org.apache.drill.exec.physical.impl.ImplCreator.getRecordBatch(ImplCreator.java:142)

        at 
org.apache.drill.exec.physical.impl.ImplCreator.getChildren(ImplCreator.java:182)

        at 
org.apache.drill.exec.physical.impl.ImplCreator.getRootExec(ImplCreator.java:110)

        at 
org.apache.drill.exec.physical.impl.ImplCreator.getExec(ImplCreator.java:87)

        at 
org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:206)

        at 
org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38)

        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

The Schema for the data I am querying in question is as follows

index_no

ANY

YES

bank_code

ANY

YES

trxn_category

ANY

YES

Debit_Account_Type

ANY

YES

Credit_Account_Type

ANY

YES

trxn_amount

ANY

YES

trxn_fee

ANY

YES

trxn_date

ANY

YES

Currency

ANY

YES

Late_reversal

ANY

YES

card_type

ANY

YES

terminal_type

ANY

YES

Acquirer

ANY

YES

Issuer

ANY

YES

Value_SettleAmount

ANY

YES

Value_RequestedAmount

ANY

YES

Volume

ANY

YES

unique_key

ANY

YES

source_node_name

ANY

YES

business_date

ANY

YES

PT_datetime_tran_local

ANY

YES

PTC_pan

ANY

YES

PTC_terminal_id

ANY

YES

PTC_card_acceptor_id_code

ANY

YES

PTC_card_acceptor_name_loc

ANY

YES

PT_system_trace_audit_nr

ANY

YES

PT_message_type

ANY

YES

PT_from_account_id

ANY

YES

PT_to_account_id

ANY

YES

PT_tran_type

ANY

YES

PT_rsp_code_req

ANY

YES

PT_rsp_code_rsp

ANY

YES

PT_settle_amount_req

ANY

YES

PT_settle_amount_rsp

ANY

YES

PTC_merchant_type

ANY

YES

PT_settle_amount_impact

ANY

YES

PT_settle_tran_fee_req

ANY

YES

PT_settle_tran_fee_rsp

ANY

YES

PT_auth_id_rsp

ANY

YES

PT_retrieval_reference_nr

ANY

YES

PTC_totals_group

ANY

YES

PTC_card_product

ANY

YES

PT_tran_currency_code

ANY

YES

PT_payee

ANY

YES

PT_tran_amount_req

ANY

YES

PT_tran_amount_rsp

ANY

YES

PT_extended_tran_type

ANY

YES

PT_tran_postilion_originated

ANY

YES

PT_tran_completed

ANY

YES

PT_tran_nr

ANY

YES

PT_retention_data

ANY

YES

PT_acquiring_inst_id_code

ANY

YES

PT_message_reason_code

ANY

YES

PT_sponsor_bank

ANY

YES

PT_datetime_tran_gmt

ANY

YES

PT_datetime_req

ANY

YES

PT_datetime_rsp

ANY

YES

PT_realtime_business_date

ANY

YES

PT_recon_business_date

ANY

YES

PT_from_account_type

ANY

YES

PT_to_account_type

ANY

YES

PT_tran_cash_req

ANY

YES

PT_tran_cash_rsp

ANY

YES

PT_tran_tran_fee_req

ANY

YES

PT_tran_tran_fee_rsp

ANY

YES

PT_tran_tran_fee_currency_code

ANY

YES

PT_tran_proc_fee_req

ANY

YES

PT_tran_proc_fee_rsp

ANY

YES

PT_tran_proc_fee_currency_code

ANY

YES

PT_settle_currency_code

ANY

YES

PT_pos_entry_mode

ANY

YES

PT_pos_condition_code

ANY

YES

PT_tran_reversed

ANY

YES

PT_card_verification_result

ANY

YES

PT_online_system_id

ANY

YES

PT_participant_id

ANY

YES

PT_receiving_inst_id_code

ANY

YES

PT_routing_type

ANY

YES

PT_pt_pos_card_input_mode

ANY

YES

PT_source_node_key

ANY

YES

PT_proc_online_system_id

ANY

YES

PTC_post_tran_cust_id

ANY

YES

PTC_source_node_name

ANY

YES

PTC_card_seq_nr

ANY

YES

PTC_expiry_date

ANY

YES

PTC_service_restriction_code

ANY

YES

PTC_terminal_owner

ANY

YES

PTC_mapped_card_acceptor_id_code

ANY

YES

PTC_pos_terminal_type

ANY

YES

PTC_pan_encrypted

ANY

YES

PTSP_Account_Nr

ANY

YES

PTSP_Code

ANY

YES

account_PTSP_Code

ANY

YES

PTSP_Name

ANY

YES

Sort_Code

ANY

YES

rdm_amt

ANY

YES

Reward_Code

ANY

YES

Reward_discount

ANY

YES

ptsp_terminal_id

ANY

YES

reward_terminal_id

ANY

YES


txn_id

ANY

YES

web_category_code

ANY

YES

web_category_name

ANY

YES

web_fee_type

ANY

YES

web_merchant_disc

ANY

YES

web_fee_cap

ANY

YES

Account_Name

ANY

YES

account_nr

ANY

YES

Acquiring_bank

ANY

YES

journal_amount

ANY

YES

xls_amount

ANY

YES

merch_cat_amount_cap

ANY

YES

merch_cat_visa_amount_cap

ANY

YES

reward_amount_cap

ANY

YES

Amount_config_state

ANY

YES

Amount_description

ANY

YES

Authorized_Person

ANY

YES

ACC_BANK_CODE

ANY

YES

merch_cat_bearer

ANY

YES

merch_cat_visa_bearer

ANY

YES

merch_cat_visa_category_code

ANY

YES

merch_cat_category_code

ANY

YES

merch_cat_visa_category_name

ANY

YES

merch_cat_category_name

ANY

YES

credit_acc_id

ANY

YES

credit_acc_nr_id

ANY

YES

credit_cardholder_acc_id

ANY

YES

credit_cardholder_acc_type

ANY

YES

CreditAccNr_acc_id

ANY

YES

CreditAccNr_acc_nr

ANY

YES

CreditAccNr_acc_nr_id

ANY

YES

CreditAccNr_se_id

ANY

YES

CreditAccNr_state

ANY

YES

Date_Modified

ANY

YES

debit_acc_id

ANY

YES

debit_acc_nr_id

ANY

YES

debit_cardholder_acc_id

ANY

YES

debit_cardholder_acc_type

ANY

YES

DebitAccNr_acc_id

ANY

YES

DebitAccNr_acc_nr

ANY

YES

DebitAccNr_acc_nr_id

ANY

YES

extended_trans_type

ANY

YES

fee

ANY

YES

Fee_amount_id

ANY

YES

merch_cat_fee_cap

ANY

YES

merch_cat_visa_fee_cap

ANY

YES

reward_fee_cap

ANY

YES

Fee_description

ANY

YES

Fee_Discount

ANY

YES

Fee_fee_id

ANY

YES

fee_id

ANY

YES

Fee_name

ANY

YES

Fee_se_id

ANY

YES

merch_cat_category_fee_type

ANY

YES

merch_cat_category_visa_fee_type

ANY

YES

merch_cat_category_merch_discount

ANY

YES

merch_cat_category_visa_merch_discount

ANY

YES

PT_sink_node_name

ANY

YES

post_tran_id

ANY

YES

Miscellaneous

ANY

YES

terminal_owner

ANY

YES

terminal_owner_name

ANY

YES

REF_rpt_account_type

ANY

YES

REF_IsPoolAccount

ANY

YES

From_Account_Type

ANY

YES

To_Account_Type

ANY

YES

Tran_type_description

ANY

YES

Region

ANY

YES

Transaction_Status

ANY

YES

Transaction_type_Impact

ANY

YES

Message_Type_Desc

ANY

YES

Response_Code_description

ANY

YES

REF_BatchId

ANY

YES

Beneficiary_Account

ANY

YES

Rate

ANY

YES

final_fee

ANY

YES

final_fee_type

ANY

YES

settlement_date

ANY

YES


Please what can I do to improve performance of this query

Reply via email to