[ https://issues.apache.org/jira/browse/MAPREDUCE-5018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354299#comment-14354299 ]
Hadoop QA commented on MAPREDUCE-5018: -------------------------------------- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org against trunk revision 47f7f18. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5260//console This message is automatically generated. > Support raw binary data with Hadoop streaming > --------------------------------------------- > > Key: MAPREDUCE-5018 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5018 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: contrib/streaming > Affects Versions: 1.1.2 > Reporter: Jay Hacker > Assignee: Steven Willis > Priority: Minor > Attachments: MAPREDUCE-5018-branch-1.1.patch, MAPREDUCE-5018.patch, > MAPREDUCE-5018.patch, justbytes.jar, mapstream > > > People often have a need to run older programs over many files, and turn to > Hadoop streaming as a reliable, performant batch system. There are good > reasons for this: > 1. Hadoop is convenient: they may already be using it for mapreduce jobs, and > it is easy to spin up a cluster in the cloud. > 2. It is reliable: HDFS replicates data and the scheduler retries failed jobs. > 3. It is reasonably performant: it moves the code to the data, maintaining > locality, and scales with the number of nodes. > Historically Hadoop is of course oriented toward processing key/value pairs, > and so needs to interpret the data passing through it. Unfortunately, this > makes it difficult to use Hadoop streaming with programs that don't deal in > key/value pairs, or with binary data in general. For example, something as > simple as running md5sum to verify the integrity of files will not give the > correct result, due to Hadoop's interpretation of the data. > There have been several attempts at binary serialization schemes for Hadoop > streaming, such as TypedBytes (HADOOP-1722); however, these are still aimed > at efficiently encoding key/value pairs, and not passing data through > unmodified. Even the "RawBytes" serialization scheme adds length fields to > the data, rendering it not-so-raw. > I often have a need to run a Unix filter on files stored in HDFS; currently, > the only way I can do this on the raw data is to copy the data out and run > the filter on one machine, which is inconvenient, slow, and unreliable. It > would be very convenient to run the filter as a map-only job, allowing me to > build on existing (well-tested!) building blocks in the Unix tradition > instead of reimplementing them as mapreduce programs. > However, most existing tools don't know about file splits, and so want to > process whole files; and of course many expect raw binary input and output. > The solution is to run a map-only job with an InputFormat and OutputFormat > that just pass raw bytes and don't split. It turns out to be a little more > complicated with streaming; I have attached a patch with the simplest > solution I could come up with. I call the format "JustBytes" (as "RawBytes" > was already taken), and it should be usable with most recent versions of > Hadoop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)