Hadoop 0.23, Avro Specific 1.6.3 and org.apache.avro.generic.GenericData$Record cannot be cast to

2012-05-13 Thread Jacob Metcalf


I have just spent several frustrating hours on getting an example MR job using 
Avro working with Hadoop and after finally getting it working I thought I would 
share my findings with everyone.
I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map and 
Reduce then attempted to deploy and run. I am setting up a development cluster 
with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my job and it 
failed with:
org.apache.avro.generic.GenericData$Record cannot be cast to 
net.jacobmetcalf.avro.Room 
Where Room is an Avro generated class. I found two problems. The first I have 
partly solved, the second one is more to do with Hadoop and is as yet unsolved:
1) Why when I am using Avro Specific does it end up going Generic?
When deserializing SpecificDatumReader.java attempts to instantiate your target 
class through reflection. If it fails to create your class it defaults to a 
GenericData.Record. This Doug has explained here: 
http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3c4d2b6d56.2070...@apache.org%3E
 But why it is doing it was a little harder to work out. Debugging I saw the 
SpecificDatumReader could not find my class in its classpath. However in my Job 
Runner I had done: 
job.setJarByClass(HouseAssemblyJob.class);  // This should 
ensure the JAR is distributed around the cluster
I expected with this Hadoop would distribute my Jar around the cluster. It may 
be doing the distribution but it definitely did not add it to the Reducers 
classpath. So to get round this I have now set HADOOP_CLASSPATH to the 
directory I am running from. This is not going to work in a real cluster where 
the Job Runner is on a different machine to where the Reducer so I am keen to 
figure out whether the problem is Hadoop 0.23, my environment variables or the 
fact I am running under Cygwin.

2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?
Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I however 
want to use 1.6.3 (and 1.7 when it comes out) because of its support for 
immutability  builders in the generated classes. I probably could just hack 
the old Avro lib out of my Hadoop distribution and drop the new one in. However 
I thought it would be cleaner to get Hadoop to distribute my jar to all 
datanodes and then manipulate my classpath to get the latest version of Avro to 
the top. So I have packaged Avro 1.6.3 into my job jar using Maven assembly and 
tried to do this in my JobRunner:
job.setJarByClass( MyJob.class);
  // This should ensure the JAR 
is distributed around the cluster   config.setBoolean( 
MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true ); // ensure my version of 
avro?
But it continues to use 1.5.3. I suspect it is again to do with my 
HADOOP_CLASSPATH which has avro-1.5.3 in it:
export 
HADOOP_CLASSPATH=$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*
If anyone has done this and has any ideas please let me know?
Thanks
Jacob 

Re: Hadoop 0.23, Avro Specific 1.6.3 and org.apache.avro.generic.GenericData$Record cannot be cast to

2012-05-13 Thread Russell Jurney
Consider Pig and AvroStorage.

Russell Jurney
twitter.com/rjurney
russell.jur...@gmail.com
datasyndrome.com

On May 13, 2012, at 4:49 AM, Jacob Metcalf jacob_metc...@hotmail.com
wrote:


I have just spent several frustrating hours on getting an example MR job
using Avro working with Hadoop and after finally getting it working I
thought I would share my findings with everyone.

I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map
and Reduce then attempted to deploy and run. I am setting up a development
cluster with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my
job and it failed with:

org.apache.avro.generic.GenericData$Record cannot be cast to
net.jacobmetcalf.avro.Room

Where Room is an Avro generated class. I found two problems. The first I
have partly solved, the second one is more to do with Hadoop and is as yet
unsolved:

1) Why when I am using Avro Specific does it end up going Generic?

When deserializing SpecificDatumReader.java attempts to instantiate your
target class through reflection. If it fails to create your class it
defaults to a GenericData.Record. This Doug has explained here:
http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3c4d2b6d56.2070...@apache.org%3E
http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3c4d2b6d56.2070...@apache.org%3E

But why it is doing it was a little harder to work out. Debugging I saw
the SpecificDatumReader could not find my class in its classpath. However
in my Job Runner I had done:

job.setJarByClass(HouseAssemblyJob.class); // This should ensure the JAR is
distributed around the cluster

I expected with this Hadoop would distribute my Jar around the cluster. It
may be doing the distribution but it definitely did not add it to the
Reducers classpath. So to get round this I have now set HADOOP_CLASSPATH to
the directory I am running from. This is not going to work in a real
cluster where the Job Runner is on a different machine to where the Reducer
so I am keen to figure out whether the problem is Hadoop 0.23, my
environment variables or the fact I am running under Cygwin.


2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?

Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I
however want to use 1.6.3 (and 1.7 when it comes out) because of its
support for immutability  builders in the generated classes. I probably
could just hack the old Avro lib out of my Hadoop distribution and drop the
new one in. However I thought it would be cleaner to get Hadoop to
distribute my jar to all datanodes and then manipulate my classpath to get
the latest version of Avro to the top. So I have packaged Avro 1.6.3 into
my job jar using Maven assembly and tried to do this in my JobRunner:

job.setJarByClass( MyJob.class);
// This should ensure the JAR is
distributed around the cluster
config.setBoolean( MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST,
true ); // ensure my version of avro?

But it continues to use 1.5.3. I suspect it is again to do with my
HADOOP_CLASSPATH which has avro-1.5.3 in it:

export
HADOOP_CLASSPATH=$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*

If anyone has done this and has any ideas please let me know?

Thanks

Jacob


Re: Hadoop 0.23, Avro Specific 1.6.3 and org.apache.avro.generic.GenericData$Record cannot be cast to

2012-05-13 Thread Ken Krugler
Hi Jacob,

On May 13, 2012, at 4:48am, Jacob Metcalf wrote:

 
 I have just spent several frustrating hours on getting an example MR job 
 using Avro working with Hadoop and after finally getting it working I thought 
 I would share my findings with everyone.
 
 I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map 
 and Reduce then attempted to deploy and run. I am setting up a development 
 cluster with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my 
 job and it failed with:
 
 org.apache.avro.generic.GenericData$Record cannot be cast to 
 net.jacobmetcalf.avro.Room 
 
 Where Room is an Avro generated class. I found two problems. The first I have 
 partly solved, the second one is more to do with Hadoop and is as yet 
 unsolved:
 
 1) Why when I am using Avro Specific does it end up going Generic?
 
 When deserializing SpecificDatumReader.java attempts to instantiate your 
 target class through reflection. If it fails to create your class it defaults 
 to a GenericData.Record. This Doug has explained here: 
 http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3c4d2b6d56.2070...@apache.org%3E
  
 
 But why it is doing it was a little harder to work out. Debugging I saw the 
 SpecificDatumReader could not find my class in its classpath. However in my 
 Job Runner I had done: 
 
   job.setJarByClass(HouseAssemblyJob.class);  // This should 
 ensure the JAR is distributed around the cluster
 
 I expected with this Hadoop would distribute my Jar around the cluster. It 
 may be doing the distribution but it definitely did not add it to the 
 Reducers classpath. So to get round this I have now set HADOOP_CLASSPATH to 
 the directory I am running from. This is not going to work in a real cluster 
 where the Job Runner is on a different machine to where the Reducer so I am 
 keen to figure out whether the problem is Hadoop 0.23, my environment 
 variables or the fact I am running under Cygwin.

If your reducer is running, then Hadoop must have distributed your job jar.

In that case, any class that's actually in your job jar (in the proper 
position) will be distributed and on the classpath.

Sometimes the problem is that you've got a dependent jar, which then needs to 
be in the lib subdirectory inside of your job jar. Are you maybe building 
your Avro generated classes into a separate jar, and then adding that to the 
job jar?

Finally, running under Cygwin is…challenging. I teach a Hadoop class, and often 
the hardest part of the lab is getting everybody's Cygwin installation working 
with Hadoop. The fact that you've got pseudo-distributed mode working on Cygwin 
is impressive in itself, but I would suggest trying your job on a real cluster, 
e.g. use Elastic MapReduce.

 2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?
 
 Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I 
 however want to use 1.6.3 (and 1.7 when it comes out) because of its support 
 for immutability  builders in the generated classes. I probably could just 
 hack the old Avro lib out of my Hadoop distribution and drop the new one in. 
 However I thought it would be cleaner to get Hadoop to distribute my jar to 
 all datanodes and then manipulate my classpath to get the latest version of 
 Avro to the top. So I have packaged Avro 1.6.3 into my job jar using Maven 
 assembly

Did you ensure that it's inside of the /lib subdirectory? What does your job 
jar look like (via jar tvf path to job jar)?

-- Ken

 and tried to do this in my JobRunner:
 
   job.setJarByClass( MyJob.class);
   // This should ensure the 
 JAR is distributed around the cluster
   config.setBoolean( 
 MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true ); // ensure my version 
 of avro?
 
 But it continues to use 1.5.3. I suspect it is again to do with my 
 HADOOP_CLASSPATH which has avro-1.5.3 in it:
 
 export 
 HADOOP_CLASSPATH=$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*
 
 If anyone has done this and has any ideas please let me know?
 
 Thanks
 
 Jacob

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Mahout  Solr






RE: Hadoop 0.23, Avro Specific 1.6.3 and org.apache.avro.generic.GenericData$Record cannot be cast to

2012-05-13 Thread Jacob Metcalf





Ken, thanks for getting back to me. 
1) The Avro specific classes are generated and packed in the same JAR as the 
mapper and reducer. Attached is my example 
http://markmail.org/download.xqy?id=m6te4atgmyrrqyv5number=1 which in parallel 
I am also getting working on MRUnit so am discussing on that forum. If you want 
to build it you will need to build odagio-avro.
I agree and cannot comprehend how if the mapper can serialize, the reducer 
cannot deserialize. My only guess is that the reducer is running in a separate 
JVM and it is only this which has classpath issues. Logically the mapper output 
would be deserialized before my reducer is instantiated. I noticed that the JAR 
does get exploded so my only thought is that there is something going wrong in 
the Cygwin/Hadoop layer at reduction.
2) Yes the latest version of avro is in my Job Jar. However I am again not sure 
how to manipulate the Hadoop classpath to ensure it is first. This is possibly 
more a topic for the Hadoop list.
Regards
Jacob


From: kkrugler_li...@transpac.com
Subject: Re: Hadoop 0.23, Avro Specific 1.6.3 and 
org.apache.avro.generic.GenericData$Record cannot be cast to 
Date: Sun, 13 May 2012 11:18:13 -0700
To: user@avro.apache.org

Hi Jacob,
On May 13, 2012, at 4:48am, Jacob Metcalf wrote:I have just spent several 
frustrating hours on getting an example MR job using Avro working with Hadoop 
and after finally getting it working I thought I would share my findings with 
everyone.
I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map and 
Reduce then attempted to deploy and run. I am setting up a development cluster 
with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my job and it 
failed with:
org.apache.avro.generic.GenericData$Record cannot be cast to 
net.jacobmetcalf.avro.Room 
Where Room is an Avro generated class. I found two problems. The first I have 
partly solved, the second one is more to do with Hadoop and is as yet unsolved:
1) Why when I am using Avro Specific does it end up going Generic?
When deserializing SpecificDatumReader.java attempts to instantiate your target 
class through reflection. If it fails to create your class it defaults to a 
GenericData.Record. This Doug has explained here: 
http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3c4d2b6d56.2070...@apache.org%3E
 But why it is doing it was a little harder to work out. Debugging I saw the 
SpecificDatumReader could not find my class in its classpath. However in my Job 
Runner I had done: 
job.setJarByClass(HouseAssemblyJob.class);  // This should 
ensure the JAR is distributed around the cluster
I expected with this Hadoop would distribute my Jar around the cluster. It may 
be doing the distribution but it definitely did not add it to the Reducers 
classpath. So to get round this I have now set HADOOP_CLASSPATH to the 
directory I am running from. This is not going to work in a real cluster where 
the Job Runner is on a different machine to where the Reducer so I am keen to 
figure out whether the problem is Hadoop 0.23, my environment variables or the 
fact I am running under Cygwin.
If your reducer is running, then Hadoop must have distributed your job jar.
In that case, any class that's actually in your job jar (in the proper 
position) will be distributed and on the classpath.
Sometimes the problem is that you've got a dependent jar, which then needs to 
be in the lib subdirectory inside of your job jar. Are you maybe building 
your Avro generated classes into a separate jar, and then adding that to the 
job jar?
Finally, running under Cygwin is…challenging. I teach a Hadoop class, and often 
the hardest part of the lab is getting everybody's Cygwin installation working 
with Hadoop. The fact that you've got pseudo-distributed mode working on Cygwin 
is impressive in itself, but I would suggest trying your job on a real cluster, 
e.g. use Elastic MapReduce.
2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?
Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I however 
want to use 1.6.3 (and 1.7 when it comes out) because of its support for 
immutability  builders in the generated classes. I probably could just hack 
the old Avro lib out of my Hadoop distribution and drop the new one in. However 
I thought it would be cleaner to get Hadoop to distribute my jar to all 
datanodes and then manipulate my classpath to get the latest version of Avro to 
the top. So I have packaged Avro 1.6.3 into my job jar using Maven assembly
Did you ensure that it's inside of the /lib subdirectory? What does your job 
jar look like (via jar tvf path to job jar)?
-- Ken
and tried to do this in my JobRunner:
job.setJarByClass( MyJob.class);
  // This should ensure the JAR 
is distributed around the cluster   config.setBoolean( 
MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, 

Re: Hadoop 0.23, Avro Specific 1.6.3 and org.apache.avro.generic.GenericData$Record cannot be cast to

2012-05-13 Thread Ken Krugler
Hi Jacob,

On May 13, 2012, at 2:03pm, Jacob Metcalf wrote:

 Ken, thanks for getting back to me. 
 
 1) The Avro specific classes are generated and packed in the same JAR as the 
 mapper and reducer. Attached is my 
 examplehttp://markmail.org/download.xqy?id=m6te4atgmyrrqyv5number=1 which in 
 parallel I am also getting working on MRUnit so am discussing on that forum. 
 If you want to build it you will need to build odagio-avro.
 
 I agree and cannot comprehend how if the mapper can serialize, the reducer 
 cannot deserialize. My only guess is that the reducer is running in a 
 separate JVM and it is only this which has classpath issues. Logically the 
 mapper output would be deserialized before my reducer is instantiated. I 
 noticed that the JAR does get exploded so my only thought is that there is 
 something going wrong in the Cygwin/Hadoop layer at reduction.
 
 2) Yes the latest version of avro is in my Job Jar. However I am again not 
 sure how to manipulate the Hadoop classpath to ensure it is first. This is 
 possibly more a topic for the Hadoop list.

Two comments…

1. Your pom.xml doesn't look like it's set up to build a proper Hadoop job jar.

After running mvn assembly:assembly you should have a job jar that has a lib 
subdirectory, and inside of that sub-dir you'll have all fo the jars (NOT the 
classes) for your dependent jars such as avro.

See http://exported.wordpress.com/2010/01/30/building-hadoop-job-jar-with-maven/

After running mvn assembly:assembly in your example directory I get a 
target/hadoop-example.jar file that's got Hadoop classes (and a bunch of 
others) all jammed inside it.

And your job jar shouldn't have Hadoop classes or jars inside it - those should 
be provided.

2. I would suggest using Hadoop 0.20.2 if you're on Cygwin.

That version avoids issues with Hadoop not being able to set permissions on 
local file system directories.

Regards,

-- Ken

 From: kkrugler_li...@transpac.com
 Subject: Re: Hadoop 0.23, Avro Specific 1.6.3 and 
 org.apache.avro.generic.GenericData$Record cannot be cast to 
 Date: Sun, 13 May 2012 11:18:13 -0700
 To: user@avro.apache.org
 
 Hi Jacob,
 
 On May 13, 2012, at 4:48am, Jacob Metcalf wrote:
 
 
 I have just spent several frustrating hours on getting an example MR job 
 using Avro working with Hadoop and after finally getting it working I thought 
 I would share my findings with everyone.
 
 I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map 
 and Reduce then attempted to deploy and run. I am setting up a development 
 cluster with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my 
 job and it failed with:
 
 org.apache.avro.generic.GenericData$Record cannot be cast to 
 net.jacobmetcalf.avro.Room 
 
 Where Room is an Avro generated class. I found two problems. The first I have 
 partly solved, the second one is more to do with Hadoop and is as yet 
 unsolved:
 
 1) Why when I am using Avro Specific does it end up going Generic?
 
 When deserializing SpecificDatumReader.java attempts to instantiate your 
 target class through reflection. If it fails to create your class it defaults 
 to a GenericData.Record. This Doug has explained here: 
 http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%3c4d2b6d56.2070...@apache.org%3E
  
 
 But why it is doing it was a little harder to work out. Debugging I saw the 
 SpecificDatumReader could not find my class in its classpath. However in my 
 Job Runner I had done: 
 
   job.setJarByClass(HouseAssemblyJob.class);  // This should 
 ensure the JAR is distributed around the cluster
 
 I expected with this Hadoop would distribute my Jar around the cluster. It 
 may be doing the distribution but it definitely did not add it to the 
 Reducers classpath. So to get round this I have now set HADOOP_CLASSPATH to 
 the directory I am running from. This is not going to work in a real cluster 
 where the Job Runner is on a different machine to where the Reducer so I am 
 keen to figure out whether the problem is Hadoop 0.23, my environment 
 variables or the fact I am running under Cygwin.
 
 If your reducer is running, then Hadoop must have distributed your job jar.
 
 In that case, any class that's actually in your job jar (in the proper 
 position) will be distributed and on the classpath.
 
 Sometimes the problem is that you've got a dependent jar, which then needs to 
 be in the lib subdirectory inside of your job jar. Are you maybe building 
 your Avro generated classes into a separate jar, and then adding that to the 
 job jar?
 
 Finally, running under Cygwin is…challenging. I teach a Hadoop class, and 
 often the hardest part of the lab is getting everybody's Cygwin installation 
 working with Hadoop. The fact that you've got pseudo-distributed mode working 
 on Cygwin is impressive in itself, but I would suggest trying your job on a 
 real cluster, e.g. use Elastic MapReduce.
 
 2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?