Re: [jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

Ted Dunning Tue, 16 Feb 2010 22:29:04 -0800

The intent is to support an avro encoded document type that is roughly like
a Lucene document.


On Tue, Feb 16, 2010 at 10:22 PM, Markus Weimer <mar...@weimo.de> wrote:

> Hi,
>
> that looks like cool stuff! Does it support arbitrary avro-serializable
> types?
>
> Thanks,
>
> Markus
>
> On Mon, Feb 15, 2010 at 7:54 PM, Drew Farris (JIRA) <j...@apache.org>
> wrote:
>
> >
> >     [
> >
> https://issues.apache.org/jira/browse/MAHOUT-274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> >
> > Drew Farris updated MAHOUT-274:
> > -------------------------------
> >
> >     Attachment: mahout-avro-examples.tar.bz
> >
> > Status update w/ new tarball which contains a maven project (mvn clean
> > install should do the trick)
> >
> > README.txt included, relevant portions included below:
> >
> > Provided are two different versions of AvroInputFormat/AvroOutputFormat
> > that are compatible with the mapred (pre 0.20) and mapreduce (0.20+)
> apis.
> > They are based on, code provided as a part of  MAPREDUCE-815 and other
> > patches. Also provided are backports of the
> > SerializationBase/AvroSerialization classes from the current hadoop-core
> > trunk.
> >
> > When writing a job using the pre 0.20 apis:
> >
> > Add serializations:
> >
> > {code}
> >        conf.setStrings("io.serializations",
> >        new String[] {
> >          WritableSerialization.class.getName(),
> >          AvroSpecificSerialization.class.getName(),
> >          AvroReflectSerialization.class.getName(),
> >          AvroGenericSerialization.class.getName()
> >        });
> > {code}
> >
> > Setup input and output formats:
> >
> > {code}
> >    conf.setInputFormat(AvroInputFormat.class);
> >    conf.setOutputFormat(AvroOutputFormat.class);
> >
> >    AvroInputFormat.setAvroInputClass(conf, AvroDocument.class);
> >    AvroOutputFormat.setAvroOutputClass(conf, AvroDocument.class);
> > {code}
> >
> > AvroInputFormat provides the specified class as the key and a
> LongWritable
> > file offset as the value.
> > AvroOutputFormat expects the specified class as the key and expects a
> > NullWritable as a value.
> >
> > If an avro serializable class is passed between the map and reduce phases
> > it is necessary to set the following:
> >
> > {code}
> >    AvroComparator.setSchema(AvroDocument._SCHEMA);
> >    conf.setClass("mapred.output.key.comparator.class",
> >      AvroComparator.class, RawComparator.class);
> > {code}
> >
> > So far I've been using avro 'specific' serialization, which compiles an
> > avro schema into a Java class. see
> > src/main/schemata/org/apache/mahout/avro/AvroDocument.avsc. This is
> > currently compiled into classes o.a.m.avro.document
> (AvroDocument|AvroField)
> > using o.a.m.avro.util.AvroDocumentCompiler (eventually to be replaced by
> a
> > maven plugin, Generated sources are currently checked in.).
> >
> > Helper classes for AvroDocument and AvroField include
> > o.a.m.avro.document.Avro(Document|Field)Builder,
> >  o.a.m.avro(Document|Field)Reader. This seems to work ok here, but I'm
> not
> > certain that this is be best pattern to use, especially when there are
> many
> > pre-existing classes (such as there are in the case of vector.
> >
> > Avro also provides reflection-based serialization and schema-based
> > serialization, both should be supported by the infrastructure that has
> been
> > backported here, but that's something else to explore.
> >
> > Examples:
> >
> > These are quick and dirty and need much cleanup work before they can be
> > taken out to the dance.
> >
> > see o.a.m.avro.text, o.a.m.avro.text.mapred and
> o.a.m.avro.text.mapreduce:
> >
> > * AvroDocumentsFromDirectory: quick and dirty port of
> > SequenceFilesFromDirectory to use AvroDocuments. Writes a file containing
> > documents in avro format; file contents is stored in a single field named
> > 'content', contents are stored in the originalText portion of this field.
> > * AvroDocumentsDumper: dump an avro documents file to a standard output
> > * AvroDocumentsWordCount: perform a wordcount on an avro document input
> > file.
> > * AvroDocumentProcessor: tokenizes the text found in the input document
> > file, reads from the originalText of the field named content and writes
> > original document+tokens to output file.
> >
> > Running the examples:
> >
> > (haven't tested with the hadoop driver yet)
> >
> > {code}
> > mvn exec:java
> > -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsFromDirectory \
> >  -Dexec.args='--parent /home/drew/mahout/20news-18828 \
> >  --outputDir /home/drew/mahout/20news-18828-example \
> >  --charset UTF-8'
> >
> > mvn exec:java
> > -Dexec.mainClass=org.apache.mahout.avro.text.mapred.AvroDocumentProcessor
> \
> >   -Dexec.args='/home/drew/mahout/20news-18828-example
> > /home/drew/mahout/20news-18828-processed'
> >
> > mvn exec:java
> > -Dexec.mainClass=org.apache.mahout.avro.text.AvroDocumentsDumper \
> >  -Dexec.args='/home/drew/mahout/20news-18828-processed/.avro-r-00000' >
> > foobar.txt
> > {code}
> >
> > The Wikipedia stuff is in there, but isn't working yet. Many thanks
> > (apologies) to Robin for the starting point for much of this code and
> > hacking it to pieces so badly.
> >
> >
> > > Use avro for serialization of structured documents.
> > > ---------------------------------------------------
> > >
> > >                 Key: MAHOUT-274
> > >                 URL: https://issues.apache.org/jira/browse/MAHOUT-274
> > >             Project: Mahout
> > >          Issue Type: Improvement
> > >            Reporter: Drew Farris
> > >            Priority: Minor
> > >         Attachments: mahout-avro-examples.tar.bz,
> > mahout-avro-examples.tar.gz
> > >
> > >
> > > Explore the intersection between Writables and Avro to see how
> > serialization can be improved within Mahout.
> > > An intermediate goal is the provide a structured document format that
> can
> > be serialized using Avro as an Input/OutputFormat and Writable
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Updated: (MAHOUT-274) Use avro for serialization of structured documents.

Reply via email to