I did an initial cut at MAHOUT-65 but was blocked by the
serialization/deserialization needed to fully address the requirements.
Now that Gson is in lib it makes sense to use it. The Dirichlet package
already has a JsonVectorAdapter which could be rewritten. It is a big
change to the on-disk format but most jobs have an initial step to
consume e.g. csv files so changing it should not break much
compatibility. I will take another crack at it.
Jeff
Grant Ingersoll wrote:
I'm about to write some code to prepare docs for clustering and I know
at least a few others on the list here have done the same. I was
wondering if anyone is in the position to share their code and
contribute to Mahout.
As I see it, we need to be able to take in text and create the matrix
of terms, where each cell is the TF/IDF (or some other weight, would
be nice to be pluggable) and then normalize the vector (and, according
to Ted, we should support using different norms). Seems like we also
need the label stuff in place
(https://issues.apache.org/jira/browse/MAHOUT-65) but I'm not sure on
the state of that patch.
As for the TF/IDF stuff, we sort of have it via the BayesTfIdfDriver,
but it needs to be a more generic. I realize we could use Lucene, but
having a solution that scales w/ Lucene is going to take work, AIUI,
whereas a M/R job seems more straightforward.
I'd like to be able to get this stuff committed relatively soon and
have the examples for other people. My shorter term goal is I'm
working on some demos using Wikipedia.
Thanks,
Grant