Re: Passing data from Client to AM

2014-01-30 Thread Hitesh Shah
Adding values to a Configuration object does not really work unless you serialize the config into a file and send it over to the AM and containers as a local resource. The application code would then need to load in this file using Configuration::addResource(). MapReduce does this by taking in a

Re: shifting sequenceFileOutput format to Avro format

2014-01-30 Thread AnilKumar B
Thanks Yong. Thanks & Regards, B Anil Kumar. On Fri, Jan 31, 2014 at 12:44 AM, java8964 wrote: > In avro, you need to think about a schema to match your data. Avor's > schema is very flexible and should be able to store all kinds of data. > > If you have a Json string, you have 2 options to ge

RE: shifting sequenceFileOutput format to Avro format

2014-01-30 Thread java8964
In avro, you need to think about a schema to match your data. Avor's schema is very flexible and should be able to store all kinds of data. If you have a Json string, you have 2 options to generate the Avro schema for it: 1) Use "type: string" to store the whole Json string into Avro. This will b

shifting sequenceFileOutput format to Avro format

2014-01-30 Thread AnilKumar B
Hi, As of now in my jobs, I am using SequenceFileOutputFormat and I am emitting custom java objects as MR output. Now I am planning to emit it in avro format, I went through few blogs but still have following doubts. 1) My current custom Writable objects has nested json format as toString(), So

Re: Capture Directory Context in Hadoop Mapper

2014-01-30 Thread Felix Chern
MultipleInputs is nice. Most of the time, I use it for reduce-side join. It's great, however, you'll need to specify different Mapper class per input directory. In our case, we try to let the Mapper itself to capture the directory information, because these directories might contain data across m

Re: DistributedCache deprecated

2014-01-30 Thread Amit Mittal
Hi Prav, You are correct, thanks for the explanation. As per below link, I can see that Job's method internally calls to DistributedCache itself ( http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-mapreduce-client-core/2.0.0-cdh4.4.0/org/apache

Re: DistributedCache deprecated

2014-01-30 Thread praveenesh kumar
Hi Amit, Side data distribution is altogether a different concept at all. Its when you set custom (key,value) pairs and use Job object for doing that, so that you can use them in your mappers/reducers. It is good when you want to pass some small information to your mappers/reducers like extra comm

Re: DistributedCache deprecated

2014-01-30 Thread Amit Mittal
Hi Prav, Yes, you are correct that DistributedCache does not upload file into memory. Also using job configuration and DistributedCache are 2 different approaches. I am referring based on "Hadoop: The definitive guide" Chapter:8 > Side Data Distribution (Page 288-295). As you are saying that now m

Re: DistributedCache deprecated

2014-01-30 Thread praveenesh kumar
Hi Amit, I am not sure how are they linked with DistributedCache.. Job configuration is not uploading any data in memory.. As far as I am aware of how DistributedCache works, nothing get loaded in memory. Distributed cache just copies the files into slave nodes, so that they are accessible to mapp