Re: Avro mapred: How to avoid schema specification in job.xml?

Scott Carey Mon, 10 Oct 2011 14:52:26 -0700

On 10/10/11 11:41 AM, "Julien Muller" <julien.mul...@ezako.com> wrote:


> Hello,
> 
> Thanks for your answer, let me try to clarify my context a bit:
> 
>> I'm not all that familiar with how Oozie interacts with Avro.
> Let's get oozie out of the picture. I use job.xml files to configure Jobs.
> This means I do not have any JobConf object and I cannot use AvroJob.
> Therefore I directly write the job properties (as what AvroJob outputs).
> 
>> The Job must set its avro.input.schema and avro.output.schema properties 
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples), 
> The solution I have now is basically based on the Avro mapred unit tests. But
> in my context, it is not an option to code (using the $SCHEMA property) at the
> job configuration level.
> where you code:
>     AvroJob.setInputSchema(job, Schema.create(Schema.Type.STRING));
> I have to copy the entire schema in job.xml file. And I have to update it
> every time my schema get updated.
> I hope I can find a better solution.

I suppose that in AvroJob we could transmit only the class name in a
property, and use that to look up the schema for generated classes using
reflection.  Could you do something similar?  I don't think it is possible
to avoid configuring at least some sort of pointer to where the schema is.
This could be via a property, or if you already have the job class, an
annotation on that class.

> 
>> and if you are using SpecificRecords and DataFiles the schema is available to
>> the code where necessary.
> I am not sure what you mean here. I am using SpecificRecords and would like to
> avoid specifying avro.input.schema, since this info is already here in the
> specific record.

Potentially the AvroMapper / AvroReducer could have a fall-back for
obtaining the schema if the property is not set  reflection on a class name
or an annotation .  If this looks like it is an enhancement request for Avro
(or a bug) please file a JIRA ticket.  Thanks!

> 
> Thanks,
> 
> Julien Muller
> 
> 2011/10/10 Scott Carey <scottca...@apache.org>
>> I'm not all that familiar with how Oozie interacts with Avro.
>> 
>> The Job must set its avro.input.schema and avro.output.schema properties 
>> this can be done in code (see the unit tests in the Avro mapred project for
>> examples), and if you are using SpecificRecords and DataFiles the schema is
>> available to the code where necessary.
>> 
>> 
>> 
>> On 10/10/11 5:41 AM, "Julien Muller" <julien.mul...@ezako.com> wrote:
>> 
>>> Hello,
>>> 
>>> I have been using avro with hadoop and oozie for months now and I am very
>>> happy with the results.
>>> 
>>> The only point I see as a limitation now is that we specify avro schemes in
>>> workflow.xml (job.xml):
>>> - avro.input.schema
>>> - avro.output.schema
>>> Since this info is already provided in Mapper/Reducer signatures, I see this
>>> as redundant. The schema is also present in all my serialized files, which
>>> means that the schema is specified in 3 different places.
>>> 
>>> From a run point of view, this is a pain, since any schema modification
>>> (let's say a simple optional field added) forces me to update many job
>>> files. This task is very error prone and since we have a large amount of
>>> jobs, it generates a lot of work.
>>> 
>>> The only solution I see now would be to find/replace in the build script,
>>> but I hope I could find a better solution by providing some generic schemes
>>> to the job file, or find a way to deactivate schema validation in the job.
>>> Any help will be appreciated!
>>> 
>>> -- 
>>> Julien Muller
>

Re: Avro mapred: How to avoid schema specification in job.xml?

Reply via email to