[jira] [Updated] (HIVE-19248) Hive replication cause file copy failures if HDFS block size differs across clusters

Sankar Hariappan (JIRA) Thu, 19 Apr 2018 10:53:18 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-19248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sankar Hariappan updated HIVE-19248:
------------------------------------
    Description: 
Hive replication uses Hadoop distcp to copy files from primary to replica 
warehouse. If the HDFS block size is different across clusters, it cause file 
copy failures.

{code}
2018-04-09 14:32:06,690 ERROR [main] org.apache.hadoop.tools.mapred.CopyMapper: 
Failure in copying 
hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 to 
hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0
java.io.IOException: File copy failed: 
hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 --> 
hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0
 at 
org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:299)
 at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:266)
 at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
Caused by: java.io.IOException: Couldn't run retriable-command: Copying 
hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 to 
hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0
 at 
org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
 at 
org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:296)
 ... 10 more
Caused by: java.io.IOException: Check-sum mismatch between 
hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 and 
hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/.distcp.tmp.attempt_1522833620762_4416_m_000000_0.
 Source and target differ in block-size. Use -pb to preserve block-sizes during 
copy. Alternatively, skip checksum-checks altogether, using -skipCrc. (NOTE: By 
skipping checksums, one runs the risk of masking data-corruption during 
file-transfer.)
 at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.compareCheckSums(RetriableFileCopyCommand.java:212)
 at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:130)
 at 
org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99)
 at 
org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
 ... 11 more
{code}

Also, REPL LOAD returns success even if distcp jobs failed.

So, need to perform 2 things.
 # Set proper options for distcp to preserve the block size and skip CRC check. 
Use options such as *-pugpbx, -update* and *-skipcrccheck.*
 # If copy of multiple files fail for some reason, need to check if any files 
completely copied by verifying the checksum and file size and skip those from 
retry.

 

  was:
This is the case where the events were deleted on source because of old event 
purging and hence min(source event id) > target event id (last replicated event 
id).

Repl dump should fail in this case so that user can drop the database and 
bootstrap again.

Cleaner thread is concurrently removing the expired events from 
NOTIFICATION_LOG table. So, it is necessary to check if the current dump missed 
any event while dumping. After fetching events in batches, we shall check if it 
is fetched in contiguous sequence of event id. If it is not in contiguous 
sequence, then likely some events missed in the dump and hence throw error.


> Hive replication cause file copy failures if HDFS block size differs across 
> clusters
> ------------------------------------------------------------------------------------
>
>                 Key: HIVE-19248
>                 URL: https://issues.apache.org/jira/browse/HIVE-19248
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2, repl
>    Affects Versions: 3.0.0
>            Reporter: Sankar Hariappan
>            Assignee: Sankar Hariappan
>            Priority: Major
>              Labels: DR, pull-request-available, replication
>             Fix For: 3.1.0
>
>
> Hive replication uses Hadoop distcp to copy files from primary to replica 
> warehouse. If the HDFS block size is different across clusters, it cause file 
> copy failures.
> {code}
> 2018-04-09 14:32:06,690 ERROR [main] 
> org.apache.hadoop.tools.mapred.CopyMapper: Failure in copying 
> hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 to 
> hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0
> java.io.IOException: File copy failed: 
> hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 
> --> 
> hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:299)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:266)
>  at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:52)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1869)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
> Caused by: java.io.IOException: Couldn't run retriable-command: Copying 
> hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 to 
> hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/000259_0
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
>  at 
> org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:296)
>  ... 10 more
> Caused by: java.io.IOException: Check-sum mismatch between 
> hdfs://chelsea/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/000259_0 
> and 
> hdfs://marilyn/apps/hive/warehouse/tpch_flat_orc_1000.db/customer/.hive-staging_hive_2018-04-09_14-30-45_723_7153496419225102220-2/-ext-10001/.distcp.tmp.attempt_1522833620762_4416_m_000000_0.
>  Source and target differ in block-size. Use -pb to preserve block-sizes 
> during copy. Alternatively, skip checksum-checks altogether, using -skipCrc. 
> (NOTE: By skipping checksums, one runs the risk of masking data-corruption 
> during file-transfer.)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.compareCheckSums(RetriableFileCopyCommand.java:212)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:130)
>  at 
> org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:99)
>  at 
> org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
>  ... 11 more
> {code}
> Also, REPL LOAD returns success even if distcp jobs failed.
> So, need to perform 2 things.
>  # Set proper options for distcp to preserve the block size and skip CRC 
> check. Use options such as *-pugpbx, -update* and *-skipcrccheck.*
>  # If copy of multiple files fail for some reason, need to check if any files 
> completely copied by verifying the checksum and file size and skip those from 
> retry.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (HIVE-19248) Hive replication cause file copy failures if HDFS block size differs across clusters

Reply via email to