On Mon, Jun 7, 2010 at 3:02 AM, Ted Dunning <[email protected]> wrote:

>
> Drew, I especially would like to hear what you think about how this would
> relate to the Avro document stuff you did.
>

At first read of your description it seems like we could consider
implementing a csv -> avro structured document mapping and then modify the
vectorization/learning code to take avro structured documents (asd) as
input. Users develop their own processors to convert from their format to
asd, or perhaps something in the vein of Solr's DataImportHandler could be
used as a general tool to load from databases or xml into asd. At a lower
level this makes your proposed lucene IndexWriter interface implementation
very attractive.

I am a little skeptical about the utility of avro for storing the vectors
themselves. Some very early initial tests suggested that using avro
reflection to derive a schema from an existing class (such as one of
mahout's vector classes), did not produce a large win in terms of
performance or space but more work in that direction still needs to happen.

I'll take a look at how the data loading relates with the rest of the code
in your patch and come back with questions.

The approach to the vectorization sounds like a pretty neat idea. I'm
interested in seeing how the vector + trace to human readable dump code
works too.

Drew

Reply via email to