[ 
https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12854268#action_12854268
 ] 

David Ciemiewicz commented on MAPREDUCE-469:
--------------------------------------------

Unfortunately I discovered that concatenated bzip2 files did not work in 
Map-Reduce until *AFTER* I went and concatenated 3TB and over 250K compressed 
files.

A colleague suggested that I "fix" my data using the following approach:

hadoop dfs -cat X | bunzip2 | bzip2 | hadoop dfs -put - X.new

I tried this with a 3GB single file concatenation of multiple bzip2 compressed 
files.

This process took just over an hour with compression taking 5-6X longer than 
decompression (as measured in CPU utilization).

It only took several minutes to concatenate the multiple part files into a 
single file.


I think that this points out that decompressing and recompressing data is not 
really a viable solution for creating large concatenations of smaller files.

The best performing solution is to create the smaller part files in parallel 
with a bunch of reducers, then concatenate them later into one (or several) 
larger files.

And so fixing Hadoop Map Reduce to be able to read concatenations of files is 
actually probably the highest return on investment by the community.




> Support concatenated gzip and bzip2 files
> -----------------------------------------
>
>                 Key: MAPREDUCE-469
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Tom White
>            Assignee: Ravi Gummadi
>
> When running MapReduce with concatenated gzip files as input only the first 
> part is read, which is confusing, to say the least. Concatenated gzip is 
> described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage 
> and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at 
> http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to