Stuart White wrote:
So I guess I'm (1) looking for "hello world" in avro, and (2)
attempting to determine the level of integration between avro and
Hadoop.  Do avro InputFormat/OutputFormat classes exist?

This is not yet a mature area. I wish integration with Hadoop was further along.

In Hadoop 0.21 (the next release) should be possible to use SequenceFile{Input,Output}Format with Avro specific and reflect data.

This is due to the changes in:

https://issues.apache.org/jira/browse/HADOOP-6120

and

https://issues.apache.org/jira/browse/HADOOP-6165

(Note however that patch did not add tests for end-to-end MapReduce, so there may still be some issues.)

For Avro generic data, perhaps the most useful with MapReduce, you'd need to somehow get the schema to the Serializer and Deserializer that are used by the shuffle, since I think it still uses the deprecated SerializationFactory#getSerialization(Class). This could be done by having the application or InputFormat add the schema to the job's Configuration, then have (a subclass of) AvroGenericDeserializer find for it there. (The Deserializer is Configurable, so it should have a copy of the Configuration available to it.) You'd use the class name passed in (metadata.get(CLASS_KEY) as the key to help lookup the schema in the config. Does that make any sense?

There's also an open issue to define an InputFormat/OutputFormat for Avro's container file format:

https://issues.apache.org/jira/browse/MAPREDUCE-815

If you're interested in helping push this forward I'll help too.

Doug

Reply via email to