Pig does serialize some classes out to the jobConf (I believe that it is a 
writeable with base64 encoding to turn the bytes into chars).  This has been 
problematic in the past because there are resource limits placed on the jobConf 
so that it does not use up too much memory on the job tracker.  If it is just a 
small amount of data then jobConf is probably the simplest place to put it.  If 
it starts to get large then I would suggest that you write it out to HDFS with 
a high replication factor, and send it through the distributed cache.  The job 
conf is just a file written to HDFS that is sent through the distributed cache 
to be processed.

--Bobby Evans

On 9/27/11 5:42 PM, "Zhiwei Xiao" <zwx...@gmail.com> wrote:

Hi,

My application needs to send some objects to map tasks, which specify how to 
process the input records. I know I can transfer them as string via the 
configuration file. But I prefer to leverage hadoop Writable interface, since 
the objects require a recursive serialization.

I tried to create a subclass of FileSplit to convey the data, but finally I 
found that it's not elegant to implement. Because the FileSplits are 
initialized in getSplits() of InputFormat, while the only way to initialize the 
InputFormat is via the setConf(). So I have to end up implementing 3 new 
subclass with the same custom fields: FileSplit, InputFormat and Configuration.

Another approach may be to write these objects to a file on the HDFS or 
DistributedCache.

I just wonder is there a better way to do this job?

Thank you.
---
Zhiwei Xiao

Reply via email to