Depends what API do you use. When writing an InputSplit implementation, it
is possible to specify on which nodes does the data reside. I am new to
Hadoop, but as far as I know, doing this
should enable the support for data locality. Moreover, implementing a
subclass of TextInputFormat and adding some encoding on the fly should not
impact any locality properties.


Piotr


2009/5/15 jason hadoop <jason.had...@gmail.com>

> A  downside of this approach is that you will not likely have data locality
> for the data on shared file systems, compared with data coming from an
> input
> split.
> That being said,
> from your script, *hadoop dfs -get FILE -* will write the file to standard
> out.
>
> On Thu, May 14, 2009 at 10:01 AM, Piotr Praczyk <piotr.prac...@gmail.com
> >wrote:
>
> > just in addition to my previous post...
> >
> > You don't have to store the enceded files in a file system of course
> since
> > you can write your own InoutFormat which wil do this on the fly... the
> > overhead should not be that big.
> >
> > Piotr
> >
> > 2009/5/14 Piotr Praczyk <piotr.prac...@gmail.com>
> >
> > > Hi
> > >
> > > If you want to read the files form HDFS and can not pass the binary
> data,
> > > you can do some encoding of it (base 64 for example, but you can think
> > about
> > > sth more efficient since the range of characters accprable in the input
> > > string is wider than that used by BASE64). It should solve the problem
> > until
> > > some king of binary input is supported ( is it going to happen? ).
> > >
> > > Piotr
> > >
> > > 2009/5/14 openresearch <qiming...@openresearchinc.com>
> > >
> > >
> > >> All,
> > >>
> > >> I have read some recommendation regarding image (binary input)
> > processing
> > >> using Hadoop-streaming which only accept text out-of-box for now.
> > >> http://hadoop.apache.org/core/docs/current/streaming.html
> > >> https://issues.apache.org/jira/browse/HADOOP-1722
> > >> http://markmail.org/message/24woaqie2a6mrboc
> > >>
> > >> However, I have not got any straight answer.
> > >>
> > >> One recommendation is to put image data on HDFS, but we have to do
> "hdf
> > >> -get" for each file/dir and process it locally which is every
> expensive.
> > >>
> > >> Another recommendation is to "...put them in a centralized place where
> > all
> > >> the hadoop nodes can access them (via .e.g, NFS mount)..." Obviously,
> IO
> > >> will becomes bottleneck and it defeat the purpose of distributed
> > >> processing.
> > >>
> > >> I also notice some enhancement ticket is open for hadoop-core. Is it
> > >> committed to any svn (0.21) branch? can somebody show me an example
> how
> > to
> > >> take *.jpg files (from HDFS), and process files in a distributed
> fashion
> > >> using streaming?
> > >>
> > >> Many thanks
> > >>
> > >> -Qiming
> > >> --
> > >> View this message in context:
> > >>
> >
> http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
> > >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > >>
> > >>
> > >
> >
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>

Reply via email to