The intent is to support an avro encoded document type that is roughly like a Lucene document.
On Tue, Feb 16, 2010 at 10:22 PM, Markus Weimer <mar...@weimo.de> wrote: > Hi, > > that looks like cool stuff! Does it support arbitrary avro-serializable > types? > > Thanks, > > Markus > > On Mon, Feb 15, 2010 at 7:54 PM, Drew Farris (JIRA) <j...@apache.org> > wrote: > > > > > [ > > > https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > > > Drew Farris updated MAHOUT-274: > > ------------------------------- > > > > Attachment: mahout-avro-examples.tar.bz > > > > Status update w/ new tarball which contains a maven project (mvn clean > > install should do the trick) > > > > README.txt included, relevant portions included below: > > > > Provided are two different versions of AvroInputFormat/AvroOutputFormat > > that are compatible with the mapred (pre 0.20) and mapreduce (0.20+) > apis. > > They are based on, code provided as a part of MAPREDUCE-815 and other > > patches. Also provided are backports of the > > SerializationBase/AvroSerialization classes from the current hadoop-core > > trunk. > > > > When writing a job using the pre 0.20 apis: > > > > Add serializations: > > > > {code} > > conf.setStrings("io.serializations", > > new String[] { > > WritableSerialization.class.getName(), > > AvroSpecificSerialization.class.getName(), > > AvroReflectSerialization.class.getName(), > > AvroGenericSerialization.class.getName() > > }); > > {code} > > > > Setup input and output formats: > > > > {code} > > conf.setInputFormat(AvroInputFormat.class); > > conf.setOutputFormat(AvroOutputFormat.class); > > > > AvroInputFormat.setAvroInputClass(conf, AvroDocument.class); > > AvroOutputFormat.setAvroOutputClass(conf, AvroDocument.class); > > {code} > > > > AvroInputFormat provides the specified class as the key and a > LongWritable > > file offset as the value. > > AvroOutputFormat expects the specified class as the key and expects a > > NullWritable as a value. > > > > If an avro serializable class is passed between the map and reduce phases > > it is necessary to set the following: > > > > {code} > > AvroComparator.setSchema(AvroDocument._SCHEMA); > > conf.setClass("mapred.output.key.comparator.class", > > AvroComparator.class, RawComparator.class); > > {code} > > > > So far I've been using avro 'specific' serialization, which compiles an > > avro schema into a Java class. see > > src/main/schemata/org/apache/mahout/avro/AvroDocument.avsc. This is > > currently compiled into classes o.a.m.avro.document > (AvroDocument|AvroField) > > using o.a.m.avro.util.AvroDocumentCompiler (eventually to be replaced by > a > > maven plugin, Generated sources are currently checked in.). > > > > Helper classes for AvroDocument and AvroField include > > o.a.m.avro.document.Avro(Document|Field)Builder, > > o.a.m.avro(Document|Field)Reader. This seems to work ok here, but I'm > not > > certain that this is be best pattern to use, especially when there are > many > > pre-existing classes (such as there are in the case of vector. > > > > Avro also provides reflection-based serialization and schema-based > > serialization, both should be supported by the infrastructure that has > been > > backported here, but that's something else to explore. > > > > Examples: > > > > These are quick and dirty and need much cleanup work before they can be > > taken out to the dance. > > > > see o.a.m.avro.text, o.a.m.avro.text.mapred and > o.a.m.avro.text.mapreduce: > > > > * AvroDocumentsFromDirectory: quick and dirty port of > > SequenceFilesFromDirectory to use AvroDocuments. Writes a file containing > > documents in avro format; file contents is stored in a single field named > > 'content', contents are stored in the originalText portion of this field. > > * AvroDocumentsDumper: dump an avro documents file to a standard output > > * AvroDocumentsWordCount: perform a wordcount on an avro document input > > file. > > * AvroDocumentProcessor: tokenizes the text found in the input document > > file, reads from the originalText of the field named content and writes > > original document+tokens to output file. > > > > Running the examples: > > > > (haven't tested with the hadoop driver yet) > > > > {code} > > mvn exec:java > > -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsFromDirectory \ > > -Dexec.args='--parent /home/drew/mahout/20news-18828 \ > > --outputDir /home/drew/mahout/20news-18828-example \ > > --charset UTF-8' > > > > mvn exec:java > > -Dexec.mainClass=org.apache.mahout.avro.text.mapred.AvroDocumentProcessor > \ > > -Dexec.args='/home/drew/mahout/20news-18828-example > > /home/drew/mahout/20news-18828-processed' > > > > mvn exec:java > > -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsDumper \ > > -Dexec.args='/home/drew/mahout/20news-18828-processed/.avro-r-00000' > > > foobar.txt > > {code} > > > > The Wikipedia stuff is in there, but isn't working yet. Many thanks > > (apologies) to Robin for the starting point for much of this code and > > hacking it to pieces so badly. > > > > > > > Use avro for serialization of structured documents. > > > --------------------------------------------------- > > > > > > Key: MAHOUT-274 > > > URL: https://issues.apache.org/jira/browse/MAHOUT-274 > > > Project: Mahout > > > Issue Type: Improvement > > > Reporter: Drew Farris > > > Priority: Minor > > > Attachments: mahout-avro-examples.tar.bz, > > mahout-avro-examples.tar.gz > > > > > > > > > Explore the intersection between Writables and Avro to see how > > serialization can be improved within Mahout. > > > An intermediate goal is the provide a structured document format that > can > > be serialized using Avro as an Input/OutputFormat and Writable > > > > -- > > This message is automatically generated by JIRA. > > - > > You can reply to this email to add a comment to the issue online. > > > > > -- Ted Dunning, CTO DeepDyve