Hi Qiming, You might consider using Dumbo, which is a Python wrapper for Hadoop Streaming. The associated typedbytes module makes it easy for streaming programs to work with binary data:
http://wiki.github.com/klbostee/dumbo http://wiki.github.com/klbostee/typedbytes http://dumbotics.com/2009/03/03/indexing-typed-bytes/ If you are using an older version of Hadoop (such as 18.3), you will need to apply the following patches to Hadoop to make typedbytes work: https://issues.apache.org/jira/browse/HADOOP-1722 https://issues.apache.org/jira/browse/HADOOP-5450 The commands you use to apply the patches might look something like this: cd <HADOOP_HOME> patch -p0 < HADOOP-1722-branch-0.18.patch patch -p0 < HADOOP-5450.patch ant package The guy who put Dumbo together, Klaas Bosteels, is incredibly helpful, and he continues to improve this useful project. Zak On Thu, May 14, 2009 at 12:39 PM, openresearch <qiming...@openresearchinc.com> wrote: > > All, > > I have read some recommendation regarding image (binary input) processing > using Hadoop-streaming which only accept text out-of-box for now. > http://hadoop.apache.org/core/docs/current/streaming.html > https://issues.apache.org/jira/browse/HADOOP-1722 > http://markmail.org/message/24woaqie2a6mrboc > > However, I have not got any straight answer. > > One recommendation is to put image data on HDFS, but we have to do "hdf > -get" for each file/dir and process it locally which is every expensive. > > Another recommendation is to "...put them in a centralized place where all > the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, IO > will becomes bottleneck and it defeat the purpose of distributed processing. > > I also notice some enhancement ticket is open for hadoop-core. Is it > committed to any svn (0.21) branch? can somebody show me an example how to > take *.jpg files (from HDFS), and process files in a distributed fashion > using streaming? > > Many thanks > > -Qiming > -- > View this message in context: > http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > >