How to process different types of avro schema

2013-03-18 Thread sourabh chaki
Hi All,

In my application I am getting a stream of avro events. This stream
contains different types of avro events belonging to different schemas. I
was wondering what is the right way to process this data and do analytics
on top of this. Can I use hive? I did study the avro serde that could be
used to decode avro data and I’m thinking I need to transform the input
stream into (multiple) entries belonging to different tables. For this I’m
considering using a mapper job that would extract these events type by type
and then we could use hive on top of these separate schemas. I’m wondering
if anyone has dealt with such scenario before and if this approach would
work with decent performance?

Alternative way is to use all the logic in M-R code for the analytics that
we want to do on top of this data. Please advise.

Thanks in advance.

Sourabh


Re: Avro and Oozie Map Reduce action

2013-03-18 Thread Harsh J
The value you're specifying for io.serializations below is incorrect:

property
nameio.serializations/name
valueorg.apache.avro.mapred.AvroSerialization,
avro.serialization.key.reader.schema,
avro.serialization.value.reader.schema,
avro.serialization.key.writer.schema,avro.serialization.value.writer.schema
/value
/property

If the goal is to include org.apache.avro.mapred.AvroSerialization,
then it should look more like:

property
  nameio.serializations/name
  
valueorg.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.io.serializer.avro.AvroSpecificSerialization,org.apache.hadoop.io.serializer.avro.AvroReflectSerialization,org.apache.avro.mapred.AvroSerialization/value
/property

That is, it must be an extension of the default values, and not a
replacement of them.

On Wed, Mar 13, 2013 at 4:05 AM, M, Paul pa...@iqt.org wrote:
 Hello,

 I am trying to run an M/R job with Avro serialization via Oozie.  I've made
 some progress in the workflow.xml, however I am still running into the
 following error.  Any thoughts?  I believe it may have to do with the
 io.serializations property below.   FYI, I am using CDH 4.2.0 mr1.

 2013-03-12 15:24:32,334 INFO org.apache.hadoop.mapred.TaskInProgress: Error
 from attempt_20130318_0080_m_00_3: java.lang.NullPointerException
 at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:356)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:389)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1407)
 at org.apache.hadoop.mapred.Child.main(Child.java:262)


 action name=mr-node
 map-reduce
 job-tracker${jobTracker}/job-tracker
 name-node${nameNode}/name-node
 prepare
 delete path=${nameNode}/user/${wf:user()}/${outputDir} /
 /prepare
 configuration
 property
 namemapred.job.queue.name/name
 value${queueName}/value
 /property

 property
 namemapreduce.reduce.class/name
 valueorg.apache.avro.mapred.HadoopReducer/value
 /property
 property
 namemapreduce.map.class/name
 valueorg.apache.avro.mapred.HadoopMapper/value
 /property


 property
 nameavro.reducer/name
 valueorg.my.project.mapreduce.CombineAvroRecordsByHourReducer
 /value
 /property

 property
 nameavro.mapper/name
 valueorg.my.project.mapreduce.ParseMetadataAsTextIntoAvroMapper
 /value
 /property


 property
 namemapreduce.inputformat.class/name
 valueorg.my.project.mapreduce.NonSplitableInputFormat/value
 /property

 !-- Key Value Mapper --
 property
 nameavro.output.schema/name
 value{type:record,name:Pair,namespace:org.apache.avro.mapred,fields:...}]}
 /value
 /property
 property
 namemapred.mapoutput.key.class/name
 valueorg.apache.avro.mapred.AvroKey/value
 /property
 property
 namemapred.mapoutput.value.class/name
 valueorg.apache.avro.mapred.AvroValue/value
 /property


 property
 nameavro.schema.output.key/name
 value{type:record,name:DataRecord,namespace:...]}]}
 /value
 /property

 property
 namemapreduce.outputformat.class/name
 valueorg.apache.hadoop.mapreduce.lib.output.TextOutputFormat
 /value
 /property

 property
 namemapred.output.key.comparator.class/name
 valueorg.apache.avro.mapred.AvroKeyComparator/value
 /property

 property
 nameio.serializations/name
 valueorg.apache.avro.mapred.AvroSerialization,
 avro.serialization.key.reader.schema,
 avro.serialization.value.reader.schema,
 avro.serialization.key.writer.schema,avro.serialization.value.writer.schema
 /value
 /property

 property
 namemapred.map.tasks/name
 value1/value
 /property



 !--Input/Output --
 property
 namemapred.input.dir/name
 value/user/${wf:user()}/input//value
 /property
 property
 namemapred.output.dir/name
 value/user/${wf:user()}/${outputDir}/value
 /property
 /configuration
 /map-reduce



-- 
Harsh J