Hi everyone. I am running C++ code using the PIPES wrapper and I am looking for some tutorials, examples or any kind of help with regards to using binary data. My problems is that I am working with large chunks of binary data and converting it to text not an option. My first question is thus, can I pass large chunks (>128 MB) of binary data through the PIPES interface? I have not been able to find documentation on this.
The way I do things now is that I bypass the Hadoop process by opening and reading the data directly from the C++ code using the HDFS C API. However, that means that I lose the data locality and causes too much network overhead to be viable at large scale. If passing binary data directly is not possible with PIPES, I need somehow to write my own RecordReader that maintains the data locality but still does not actually emit the data, (I just need to make sure the c++ mapper reads the same data from a local source when it is spawned). The recordreader actually does not need to read the data at all. Generating a config string that tells the C++ mapper code what to read would be just fine. The second question is thus, how to write my own RecordReader in the C++ or JAVA? I also would like information on how Hadoop maintains the data locality between RecordReaders and the spawned map tasks. Any information is most welcome. Regards GorGo -- View this message in context: http://old.nabble.com/Hadoop-PIPES-job-using-C%2B%2B-and-binary-data-results-in-data-locality-problem.-tp33112818p33112818.html Sent from the Hadoop core-user mailing list archive at Nabble.com.