Hi, 
Since I am not on the ORC mailing list… and since the ORC java code is in the 
hive APIs… this seems like a good place to start. ;-)


So… 

Ran in to a little problem… 

One of my developers was writing a map/reduce job to read records from a source 
and after some filter, write the result set to an ORC file. 
There’s an example of how to do this at:
http://hadoopcraft.blogspot.com/2014/07/generating-orc-files-using-mapreduce.html

So far, so good. 
But now here’s the problem….  Large source data, means many mappers and with 
the filter, the number of output rows is a fraction in terms of size. 
So we want to write to a single reducer. (An identity reducer) so that we get 
only a single file. 

Here’s the snag. 

We were using the OrcSerde class to serialize the data and generate an Orc row 
which we then wrote to the file. 

Looking at the source code for OrcSerde, OrcSerde.serialize() returns a 
OrcSerdeRow.
see: 
http://grepcode.com/file/repo1.maven.org/maven2/co.cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java

OrcSerdeRow implements Writable and as we can see in the example code… for a 
map only example… context.write(Text, Writable) works. 

However… if we attempt to make this in to a Map/Reduce job, we run in to a 
problem during run time. the context.write() throws the following exception:
 "Error: java.io.IOException: Type mismatch in value from map: expected 
org.apache.hadoop.io.Writable, received 
org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow”


The goal was to reduce the orc rows and then write out in the reducer. 

I’m curious as to why the context.write() fails? 
The error is a bit cryptic since the OrcSerdeRow implements Writable… so the 
error message doesn’t make sense. 


Now the quick fix is to borrow the ArrayListWritable from giraph and create the 
list of fields in to an ArrayListWritable and pass that to the reducer which 
will then use that to generate the ORC file. 

Trying to figure out why the context.write() fails… when sending to reducer 
while it works if its a mapside write.

The documentation on the ORC site is … well… to be polite… lacking. ;-) 

I have some ideas why it doesn’t work, however I would like to confirm my 
suspicions. 

Thx

-Mike


Reply via email to