Hi Jacob,

On May 13, 2012, at 4:48am, Jacob Metcalf wrote:

> 
> I have just spent several frustrating hours on getting an example MR job 
> using Avro working with Hadoop and after finally getting it working I thought 
> I would share my findings with everyone.
> 
> I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map 
> and Reduce then attempted to deploy and run. I am setting up a development 
> cluster with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my 
> job and it failed with:
> 
> "org.apache.avro.generic.GenericData$Record cannot be cast to 
> net.jacobmetcalf.avro.Room" 
> 
> Where Room is an Avro generated class. I found two problems. The first I have 
> partly solved, the second one is more to do with Hadoop and is as yet 
> unsolved:
> 
> 1) Why when I am using Avro Specific does it end up going Generic?
> 
> When deserializing SpecificDatumReader.java attempts to instantiate your 
> target class through reflection. If it fails to create your class it defaults 
> to a GenericData.Record. This Doug has explained here: 
> http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3c4d2b6d56.2070...@apache.org%3E
>  
> 
> But why it is doing it was a little harder to work out. Debugging I saw the 
> SpecificDatumReader could not find my class in its classpath. However in my 
> Job Runner I had done: 
> 
>               job.setJarByClass(HouseAssemblyJob.class);      // This should 
> ensure the JAR is distributed around the cluster
> 
> I expected with this Hadoop would distribute my Jar around the cluster. It 
> may be doing the distribution but it definitely did not add it to the 
> Reducers classpath. So to get round this I have now set HADOOP_CLASSPATH to 
> the directory I am running from. This is not going to work in a real cluster 
> where the Job Runner is on a different machine to where the Reducer so I am 
> keen to figure out whether the problem is Hadoop 0.23, my environment 
> variables or the fact I am running under Cygwin.

If your reducer is running, then Hadoop must have distributed your job jar.

In that case, any class that's actually in your job jar (in the proper 
position) will be distributed and on the classpath.

Sometimes the problem is that you've got a dependent jar, which then needs to 
be in the "lib" subdirectory inside of your job jar. Are you maybe building 
your Avro generated classes into a separate jar, and then adding that to the 
job jar?

Finally, running under Cygwin is…challenging. I teach a Hadoop class, and often 
the hardest part of the lab is getting everybody's Cygwin installation working 
with Hadoop. The fact that you've got pseudo-distributed mode working on Cygwin 
is impressive in itself, but I would suggest trying your job on a real cluster, 
e.g. use Elastic MapReduce.

> 2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?
> 
> Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I 
> however want to use 1.6.3 (and 1.7 when it comes out) because of its support 
> for immutability & builders in the generated classes. I probably could just 
> hack the old Avro lib out of my Hadoop distribution and drop the new one in. 
> However I thought it would be cleaner to get Hadoop to distribute my jar to 
> all datanodes and then manipulate my classpath to get the latest version of 
> Avro to the top. So I have packaged Avro 1.6.3 into my job jar using Maven 
> assembly

Did you ensure that it's inside of the /lib subdirectory? What does your job 
jar look like (via "jar tvf <path to job jar>")?

-- Ken

> and tried to do this in my JobRunner:
> 
>               job.setJarByClass( MyJob.class);                                
>                                                   // This should ensure the 
> JAR is distributed around the cluster
>               config.setBoolean( 
> MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true ); // ensure my version 
> of avro?
> 
> But it continues to use 1.5.3. I suspect it is again to do with my 
> HADOOP_CLASSPATH which has avro-1.5.3 in it:
> 
>                 export 
> HADOOP_CLASSPATH="$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*"
> 
> If anyone has done this and has any ideas please let me know?
> 
> Thanks
> 
> Jacob

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to