[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13664486#comment-13664486
 ] 

Jay Hacker commented on MAPREDUCE-5018:
---------------------------------------

You're welcome!  

It might be easier to just split your inputs yourself before putting them in 
HDFS (see {{split(1)}}), but perhaps your files are already in HDFS.

JustBytes shouldn't modify or interpret your data at all; it reads an entire 
file in binary, gives those exact bytes to your mapper, and writes out the 
exact bytes your mapper gives.  It does not know or care about newlines.  I 
would encourage you to run {{md5sum}} on your data outside HDFS and via 
{{mapstream}} to verify that it is not changing your data at all, and let me 
know if it is.
                
> Support raw binary data with Hadoop streaming
> ---------------------------------------------
>
>                 Key: MAPREDUCE-5018
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>          Components: contrib/streaming
>            Reporter: Jay Hacker
>            Priority: Minor
>         Attachments: justbytes.jar, MAPREDUCE-5018.patch, mapstream
>
>
> People often have a need to run older programs over many files, and turn to 
> Hadoop streaming as a reliable, performant batch system.  There are good 
> reasons for this:
> 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
> it is easy to spin up a cluster in the cloud.
> 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
> 3. It is reasonably performant: it moves the code to the data, maintaining 
> locality, and scales with the number of nodes.
> Historically Hadoop is of course oriented toward processing key/value pairs, 
> and so needs to interpret the data passing through it.  Unfortunately, this 
> makes it difficult to use Hadoop streaming with programs that don't deal in 
> key/value pairs, or with binary data in general.  For example, something as 
> simple as running md5sum to verify the integrity of files will not give the 
> correct result, due to Hadoop's interpretation of the data.  
> There have been several attempts at binary serialization schemes for Hadoop 
> streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed 
> at efficiently encoding key/value pairs, and not passing data through 
> unmodified.  Even the "RawBytes" serialization scheme adds length fields to 
> the data, rendering it not-so-raw.
> I often have a need to run a Unix filter on files stored in HDFS; currently, 
> the only way I can do this on the raw data is to copy the data out and run 
> the filter on one machine, which is inconvenient, slow, and unreliable.  It 
> would be very convenient to run the filter as a map-only job, allowing me to 
> build on existing (well-tested!) building blocks in the Unix tradition 
> instead of reimplementing them as mapreduce programs.
> However, most existing tools don't know about file splits, and so want to 
> process whole files; and of course many expect raw binary input and output.  
> The solution is to run a map-only job with an InputFormat and OutputFormat 
> that just pass raw bytes and don't split.  It turns out to be a little more 
> complicated with streaming; I have attached a patch with the simplest 
> solution I could come up with.  I call the format "JustBytes" (as "RawBytes" 
> was already taken), and it should be usable with most recent versions of 
> Hadoop.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to