Re: Writing to multiple AvroSchemas in MapReduce

Sam Groth Thu, 25 Jun 2015 16:51:07 -0700

>From the link:Setting in config:
AvroMultipleOutputs.addNamedOutput(job, "avro1", AvroKeyValueOutputFormat.class,
 keyschema, valueSchema);Then in the reducer you can write to the named 
ouput:amos.write("avro1",datum,NullWritable.get());



     On Thursday, June 25, 2015 5:23 PM, Nishanth S <[email protected]> 
wrote:
   

 The avro documentaion here says it is possible but doesnt  say how to 
configure the Avrojob in the driver.
http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html

-Nishanth
On Thu, Jun 25, 2015 at 4:10 PM, Sam Groth <[email protected]> wrote:

Looking at the example (http://avro.apache.org/docs/current/mr.html), I don't 
think it would be possible to configure multiple output schemas in one job. A 
JobConf can only set one writer schema with one output path 
(http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/JobConf.html).
 I believe it is required that all output data from a job has the same schema. 
I have not seen any use case where a map reduce job can have multiple output 
schemas.

Sam

 


     On Thursday, June 25, 2015 4:35 PM, Nishanth S <[email protected]> 
wrote:
   

 Thank you Sam.I  am trying to read only one binary file in map reduce and 
split that into 4 avro files each having different schema.I am trying to do  
this in one job but I am still not sure how to specify multipleoutput schemas 
to an Avrojob instance.Do we need to create multiple instances of Avrojob in 
the map reduce driver to do this?.
Thanks,Nishan
On Thu, Jun 25, 2015 at 2:53 PM, Sam Groth <[email protected]> wrote:

If you process 4 files with schemas A, B, C, and D as the writer schemas, then 
I would assume that you would want to specify the reader schema using the 
setInput*Schema methods. Then you can set the writer schema with the methods 
that you are calling. To be clear all data processed by the job should have one 
reader schema that is determined when the data is read, and there should also 
be one writer schema (possibly different from the reader schema) when the data 
is written back to files. If you need to process the data from each schema 
independently, you should probably create one job for each schema.

Disclaimer: I have never used the AvroJob interface directly; so this is just 
me inferring what I think it should do based on my experience with AvroStorage 
and the other language specific Avro interfaces.

Hope this helps,
Sam
 


     On Thursday, June 25, 2015 12:53 PM, Nishanth S <[email protected]> 
wrote:
   

 

Hello All,
We are using avro 1.7.7  and hadoop 2.5.1 in our project.We need to process a 
mixed mode binary file using map reduce and have the output as multiple avro 
files and each of these avro files would have different avro schemas.I looked 
at AvroMultipleOutputs class but did not completely understand  on what needs 
to be done in the driver class.This is a map only job the output of which 
should be  4 different avro files(which has different avro schemas) into 
different hdfs directories.
Do we need to set all key and value avro schemas to Avrojob in driver class?
AvroJob.setOutputKeySchema(job, 
Schema.create(Schema.Type.NULL));AvroJob.setOutputValueSchema(job, 
A.getClassSchema());


Now if  I have schemas B,C and D  how would  these be set to AvroJob?.Thanks 
for  your help.

Thanks,Nishan

Re: Writing to multiple AvroSchemas in MapReduce

Reply via email to