Jay Hacker created MAPREDUCE-5018:
-------------------------------------

             Summary: Support raw binary data with Hadoop streaming
                 Key: MAPREDUCE-5018
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
          Components: contrib/streaming
            Reporter: Jay Hacker
            Priority: Minor


People often have a need to run older programs over many files, and turn to 
Hadoop streaming as a reliable, performant batch system.  There are good 
reasons for this:

1. Hadoop is convenient: they may already be using it for mapreduce jobs, and 
it is easy to spin up a cluster in the cloud.
2. It is reliable: HDFS replicates data and the scheduler retries failed jobs.
3. It is reasonably performant: it moves the code to the data, maintaining 
locality, and scales with the number of nodes.

Historically Hadoop is of course oriented toward processing key/value pairs, 
and so needs to interpret the data passing through it.  Unfortunately, this 
makes it difficult to use Hadoop streaming with programs that don't deal in 
key/value pairs, or with binary data in general.  For example, something as 
simple as running md5sum to verify the integrity of files will not give the 
correct result, due to Hadoop's interpretation of the data.  

There have been several attempts at binary serialization schemes for Hadoop 
streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed at 
efficiently encoding key/value pairs, and not passing data through unmodified.  
Even the "RawBytes" serialization scheme adds length fields to the data, 
rendering it not-so-raw.

I often have a need to run a Unix filter on files stored in HDFS; currently, 
the only way I can do this on the raw data is to copy the data out and run the 
filter on one machine, which is inconvenient, slow, and unreliable.  It would 
be very convenient to run the filter as a map-only job, allowing me to build on 
existing (well-tested!) building blocks in the Unix tradition instead of 
reimplementing them as mapreduce programs.

However, most existing tools don't know about file splits, and so want to 
process whole files; and of course many expect raw binary input and output.  
The solution is to run a map-only job with an InputFormat and OutputFormat that 
just pass raw bytes and don't split.  It turns out to be a little more 
complicated with streaming; I have attached a patch with the simplest solution 
I could come up with.  I call the format "JustBytes" (as "RawBytes" was already 
taken), and it should be usable with most recent versions of Hadoop.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to