Depends what API do you use. When writing an InputSplit implementation, it is possible to specify on which nodes does the data reside. I am new to Hadoop, but as far as I know, doing this should enable the support for data locality. Moreover, implementing a subclass of TextInputFormat and adding some encoding on the fly should not impact any locality properties.
Piotr 2009/5/15 jason hadoop <jason.had...@gmail.com> > A downside of this approach is that you will not likely have data locality > for the data on shared file systems, compared with data coming from an > input > split. > That being said, > from your script, *hadoop dfs -get FILE -* will write the file to standard > out. > > On Thu, May 14, 2009 at 10:01 AM, Piotr Praczyk <piotr.prac...@gmail.com > >wrote: > > > just in addition to my previous post... > > > > You don't have to store the enceded files in a file system of course > since > > you can write your own InoutFormat which wil do this on the fly... the > > overhead should not be that big. > > > > Piotr > > > > 2009/5/14 Piotr Praczyk <piotr.prac...@gmail.com> > > > > > Hi > > > > > > If you want to read the files form HDFS and can not pass the binary > data, > > > you can do some encoding of it (base 64 for example, but you can think > > about > > > sth more efficient since the range of characters accprable in the input > > > string is wider than that used by BASE64). It should solve the problem > > until > > > some king of binary input is supported ( is it going to happen? ). > > > > > > Piotr > > > > > > 2009/5/14 openresearch <qiming...@openresearchinc.com> > > > > > > > > >> All, > > >> > > >> I have read some recommendation regarding image (binary input) > > processing > > >> using Hadoop-streaming which only accept text out-of-box for now. > > >> http://hadoop.apache.org/core/docs/current/streaming.html > > >> https://issues.apache.org/jira/browse/HADOOP-1722 > > >> http://markmail.org/message/24woaqie2a6mrboc > > >> > > >> However, I have not got any straight answer. > > >> > > >> One recommendation is to put image data on HDFS, but we have to do > "hdf > > >> -get" for each file/dir and process it locally which is every > expensive. > > >> > > >> Another recommendation is to "...put them in a centralized place where > > all > > >> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously, > IO > > >> will becomes bottleneck and it defeat the purpose of distributed > > >> processing. > > >> > > >> I also notice some enhancement ticket is open for hadoop-core. Is it > > >> committed to any svn (0.21) branch? can somebody show me an example > how > > to > > >> take *.jpg files (from HDFS), and process files in a distributed > fashion > > >> using streaming? > > >> > > >> Many thanks > > >> > > >> -Qiming > > >> -- > > >> View this message in context: > > >> > > > http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html > > >> Sent from the Hadoop core-user mailing list archive at Nabble.com. > > >> > > >> > > > > > > > > > -- > Alpha Chapters of my book on Hadoop are available > http://www.apress.com/book/view/9781430219422 > www.prohadoopbook.com a community for Hadoop Professionals >