Hadoop PIPES job using C++ and binary data results in data locality problem.

GorGo Tue, 10 Jan 2012 08:31:31 -0800

Hi everyone.

I am running C++ code using the PIPES wrapper and I am looking for some
tutorials, examples or any kind of help with regards to using binary data. 
My problems is that I am working with large chunks of binary data and
converting it to text not an option.
My first question is thus, can I pass large chunks (>128 MB) of binary data
through the PIPES interface? 
I have not been able to find documentation on this.


The way I do things now is that I bypass the Hadoop process by opening and
reading the data directly from the C++ code using the HDFS C API. However,
that means that I lose the data locality and causes too much network
overhead to be viable at large scale.

If passing binary data directly is not possible with PIPES, I need somehow
to write my own RecordReader that maintains the data locality but still does
not actually emit the data, (I just need to make sure the c++ mapper reads
the same data from a local source when it is spawned). 
The recordreader actually does not need to read the data at all. Generating
a config string that tells the C++ mapper code what to read would be just
fine. 

The second question is thus, how to write my own RecordReader in the C++ or
JAVA? 
I also would like information on how Hadoop maintains the data locality
between RecordReaders and the spawned map tasks. 

Any information is most welcome. 

Regards 
   GorGo
-- 
View this message in context: 
http://old.nabble.com/Hadoop-PIPES-job-using-C%2B%2B-and-binary-data-results-in-data-locality-problem.-tp33112818p33112818.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Hadoop PIPES job using C++ and binary data results in data locality problem.

Reply via email to