[ https://issues.apache.org/jira/browse/IMPALA-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163457#comment-17163457 ]
Peter Vary commented on IMPALA-9923: ------------------------------------ Here is what I have found in the namenode logs: {code:java} [petervary:~/Downloads/impala-private-parameterized-DEBUG-7389] 12s $ grep "/test-warehouse/managed/tpcds.store_sales_orc_def/ss_sold_date_sk=2451213/base_0000002/_orc_acid_version" ./logs/cluster/cdh7-node-1/hadoop-hdfs/hdfs-namenode.log2020-07-20 07:50:51,170 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073747151_6327, replicas=127.0.0.1:31002, 127.0.0.1:31000, 127.0.0.1:31001 for /test-warehouse/managed/tpcds.store_sales_orc_def/ss_sold_date_sk=2451213/base_0000002/_orc_acid_version 2020-07-20 07:50:51,180 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073747151_6327 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /test-warehouse/managed/tpcds.store_sales_orc_def/ss_sold_date_sk=2451213/base_0000002/_orc_acid_version {code} Also there are plenty of things like this in the HS2 log: {code:java} [petervary:~/Downloads/impala-private-parameterized-DEBUG-7389] 1 $ grep "that was not committed" ./logs/cluster/hive-server2.log 2020-07-20T07:50:56,752 INFO [HiveServer2-Background-Pool: Thread-4612] FileOperations: Deleting hdfs://localhost:20500/test-warehouse/managed/tpcds.store_sales_orc_def/ss_sold_date_sk=2451179/base_0000002/bucket_00001_0 that was not committed 2020-07-20T07:50:56,755 INFO [HiveServer2-Background-Pool: Thread-4612] FileOperations: Deleting hdfs://localhost:20500/test-warehouse/managed/tpcds.store_sales_orc_def/ss_sold_date_sk=2451179/base_0000002/bucket_00001_1 that was not committed {code} My understanding is that the above should happen only if there are multiple attempts for a task. ([~kuczoram] can help me here with some more detailed explanation, I think) So base on the above, I think the cluster is overloaded/badly configured/whatever, and hive recovers from many of the retries with [~kuczoram]'s code, but finally when the HDFS write of the {{_orc_acid_version}} is succeeded but the state is {{COMMITTED but not COMPLETE}}, then this kills us. The status was introduced by https://issues.apache.org/jira/browse/HDFS-8999, and there was some follow-up issue https://issues.apache.org/jira/browse/HDFS-14429, which might not be related. What I would do is: * check the config of the test cluster and try to optimize it, or * contact someone from the HDFS team, to further analyze the issue. Thanks, Peter > Data loading of TPC-DS ORC fails with "Fail to get checksum" > ------------------------------------------------------------ > > Key: IMPALA-9923 > URL: https://issues.apache.org/jira/browse/IMPALA-9923 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure > Reporter: Tim Armstrong > Assignee: Zoltán Borók-Nagy > Priority: Critical > Labels: broken-build, flaky > Attachments: load-tpcds-core-hive-generated-orc-def-block.sql, > load-tpcds-core-hive-generated-orc-def-block.sql.log > > > {noformat} > INFO : Loading data to table tpcds_orc_def.store_sales partition > (ss_sold_date_sk=null) from > hdfs://localhost:20500/test-warehouse/managed/tpcds.store_sales_orc_def > INFO : > ERROR : FAILED: Execution Error, return code 1 from > org.apache.hadoop.hive.ql.exec.MoveTask. java.io.IOException: Fail to get > checksum, since file > /test-warehouse/managed/tpcds.store_sales_orc_def/ss_sold_date_sk=2451646/base_0000003/_orc_acid_version > is under construction. > INFO : Completed executing > command(queryId=ubuntu_20200707055650_a1958916-1e85-4db5-b1bc-cc63d80b3537); > Time taken: 14.512 seconds > INFO : OK > Error: Error while compiling statement: FAILED: Execution Error, return code > 1 from org.apache.hadoop.hive.ql.exec.MoveTask. java.io.IOException: Fail to > get checksum, since file > /test-warehouse/managed/tpcds.store_sales_orc_def/ss_sold_date_sk=2451646/base_0000003/_orc_acid_version > is under construction. (state=08S01,code=1) > java.sql.SQLException: Error while compiling statement: FAILED: Execution > Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask. > java.io.IOException: Fail to get checksum, since file > /test-warehouse/managed/tpcds.store_sales_orc_def/ss_sold_date_sk=2451646/base_0000003/_orc_acid_version > is under construction. > at > org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:401) > at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:266) > at org.apache.hive.beeline.Commands.executeInternal(Commands.java:1007) > at org.apache.hive.beeline.Commands.execute(Commands.java:1217) > at org.apache.hive.beeline.Commands.sql(Commands.java:1146) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1497) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1355) > at org.apache.hive.beeline.BeeLine.executeFile(BeeLine.java:1329) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1127) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1082) > at > org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:546) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:528) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:318) > at org.apache.hadoop.util.RunJar.main(RunJar.java:232) > Closing: 0: jdbc:hive2://localhost:11050/default;auth=none > {noformat} > https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/11223/ -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org