Stuart White wrote:
So I guess I'm (1) looking for "hello world" in avro, and (2)
attempting to determine the level of integration between avro and
Hadoop. Do avro InputFormat/OutputFormat classes exist?
This is not yet a mature area. I wish integration with Hadoop was
further along.
In Hadoop 0.21 (the next release) should be possible to use
SequenceFile{Input,Output}Format with Avro specific and reflect data.
This is due to the changes in:
https://issues.apache.org/jira/browse/HADOOP-6120
and
https://issues.apache.org/jira/browse/HADOOP-6165
(Note however that patch did not add tests for end-to-end MapReduce, so
there may still be some issues.)
For Avro generic data, perhaps the most useful with MapReduce, you'd
need to somehow get the schema to the Serializer and Deserializer that
are used by the shuffle, since I think it still uses the deprecated
SerializationFactory#getSerialization(Class). This could be done by
having the application or InputFormat add the schema to the job's
Configuration, then have (a subclass of) AvroGenericDeserializer find
for it there. (The Deserializer is Configurable, so it should have a
copy of the Configuration available to it.) You'd use the class name
passed in (metadata.get(CLASS_KEY) as the key to help lookup the schema
in the config. Does that make any sense?
There's also an open issue to define an InputFormat/OutputFormat for
Avro's container file format:
https://issues.apache.org/jira/browse/MAPREDUCE-815
If you're interested in helping push this forward I'll help too.
Doug