[ https://issues.apache.org/jira/browse/HIVE-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sankar Hariappan updated HIVE-19248: ------------------------------------ Description: Hive replication uses Hadoop distcp to copy files from primary to replica warehouse. If the HDFS block size is different across clusters, it cause file copy failures. {code} 2018-04-09 14:32:06,690 ERROR [main] org.apache.hadoop.tools.mapred.CopyMapper: Failure in copying hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 to hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0 java.io.IOException: File copy failed: hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 --> hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0 at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:299) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:266) at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 to hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0 at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101) at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:296) ... 10 more Caused by: java.io.IOException: Check-sum mismatch between hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 and hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/.distcp.tmp.attempt_1522833620762_4416_m_000000_0. Source and target differ in block-size. Use -pb to preserve block-sizes during copy. Alternatively, skip checksum-checks altogether, using -skipCrc. (NOTE: By skipping checksums, one runs the risk of masking data-corruption during file-transfer.) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.compareCheckSums(RetriableFileCopyCommand.java:212) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:130) at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99) at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87) ... 11 more {code} Also, REPL LOAD returns success even if distcp jobs failed. So, need to perform 2 things. # Set proper options for distcp to preserve the block size and skip CRC check. Use options such as *-pugpbx, -update* and *-skipcrccheck.* # If copy of multiple files fail for some reason, need to check if any files completely copied by verifying the checksum and file size and skip those from retry. was: This is the case where the events were deleted on source because of old event purging and hence min(source event id) > target event id (last replicated event id). Repl dump should fail in this case so that user can drop the database and bootstrap again. Cleaner thread is concurrently removing the expired events from NOTIFICATION_LOG table. So, it is necessary to check if the current dump missed any event while dumping. After fetching events in batches, we shall check if it is fetched in contiguous sequence of event id. If it is not in contiguous sequence, then likely some events missed in the dump and hence throw error. > Hive replication cause file copy failures if HDFS block size differs across > clusters > ------------------------------------------------------------------------------------ > > Key: HIVE-19248 > URL: https://issues.apache.org/jira/browse/HIVE-19248 > Project: Hive > Issue Type: Bug > Components: HiveServer2, repl > Affects Versions: 3.0.0 > Reporter: Sankar Hariappan > Assignee: Sankar Hariappan > Priority: Major > Labels: DR, pull-request-available, replication > Fix For: 3.1.0 > > > Hive replication uses Hadoop distcp to copy files from primary to replica > warehouse. If the HDFS block size is different across clusters, it cause file > copy failures. > {code} > 2018-04-09 14:32:06,690 ERROR [main] > org.apache.hadoop.tools.mapred.CopyMapper: Failure in copying > hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 to > hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0 > java.io.IOException: File copy failed: > hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 > --> > hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0 > at > org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:299) > at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:266) > at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164) > Caused by: java.io.IOException: Couldn't run retriable-command: Copying > hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 to > hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0 > at > org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101) > at > org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:296) > ... 10 more > Caused by: java.io.IOException: Check-sum mismatch between > hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 > and > hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/.distcp.tmp.attempt_1522833620762_4416_m_000000_0. > Source and target differ in block-size. Use -pb to preserve block-sizes > during copy. Alternatively, skip checksum-checks altogether, using -skipCrc. > (NOTE: By skipping checksums, one runs the risk of masking data-corruption > during file-transfer.) > at > org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.compareCheckSums(RetriableFileCopyCommand.java:212) > at > org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:130) > at > org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99) > at > org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87) > ... 11 more > {code} > Also, REPL LOAD returns success even if distcp jobs failed. > So, need to perform 2 things. > # Set proper options for distcp to preserve the block size and skip CRC > check. Use options such as *-pugpbx, -update* and *-skipcrccheck.* > # If copy of multiple files fail for some reason, need to check if any files > completely copied by verifying the checksum and file size and skip those from > retry. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)