[ 
https://issues.apache.org/jira/browse/HADOOP-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas resolved HADOOP-2032.
-----------------------------------

    Resolution: Duplicate

Fixed by HADOOP-2033

> distcp split generation does not work correctly
> -----------------------------------------------
>
>                 Key: HADOOP-2032
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2032
>             Project: Hadoop
>          Issue Type: Bug
>          Components: util
>            Reporter: Runping Qi
>
> With the current implementation, distcp will always assign multiple files to 
> one mapper to copy, no matter how large 
> are the files. This is because the CopyFiles class uses a sequencefile to 
> store the list of files to be copied, 
> one record per file. CopyFile class correctly generates one split per record 
> in the sequence file. However, 
> due to  the way the sequence file record reader works, the minimum unit for 
> splits is the segments between the 
> "syncmarks" in the sequence file. 
> This results in the strange behavior that some mappers get zero records (zero 
> files to copy) even though their 
> split lengths are non-zero, while other mappers get multiple records 
> (multiple filesto copy) from their split (and beyond
> to the next sync mark). 
> When CopyFile class creates the sequencefile, it does try to place a sync 
> mark between splitable segments in the sequence file by calling sync() 
> function of the sequence file record writer. 
> Unfortunately, the sync() function is a no-op for files that are not block 
> compressed.
> Naturally, after I changed the compression type for the sequence file to 
> block compression,
> mappers got the correct records from their splits.
> So a simple fix is to change the compression tye to CompressionType.BLOCK:
> {code}
> // create src list
>     SequenceFile.Writer writer = SequenceFile.createWriter(
>         jobDirectory.getFileSystem(jobConf), jobConf, srcfilelist,
>         LongWritable.class, FilePair.class,
>         SequenceFile.CompressionType.BLOCK);.
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to