No need to worry about back compatibility at this point, even for on-
disk formats. Such is life in a pre-1.0 world. Sure, we should
strive to minimize it and get it right the first time, but such is
life in open source.
On May 28, 2009, at 9:07 AM, Jeff Eastman wrote:
I did an initial cut at MAHOUT-65 but was blocked by the
serialization/deserialization needed to fully address the
requirements. Now that Gson is in lib it makes sense to use it. The
Dirichlet package already has a JsonVectorAdapter which could be
rewritten. It is a big change to the on-disk format but most jobs
have an initial step to consume e.g. csv files so changing it should
not break much compatibility. I will take another crack at it.
Jeff
Grant Ingersoll wrote:
I'm about to write some code to prepare docs for clustering and I
know at least a few others on the list here have done the same. I
was wondering if anyone is in the position to share their code and
contribute to Mahout.
As I see it, we need to be able to take in text and create the
matrix of terms, where each cell is the TF/IDF (or some other
weight, would be nice to be pluggable) and then normalize the
vector (and, according to Ted, we should support using different
norms). Seems like we also need the label stuff in place (https://issues.apache.org/jira/browse/MAHOUT-65
) but I'm not sure on the state of that patch.
As for the TF/IDF stuff, we sort of have it via the
BayesTfIdfDriver, but it needs to be a more generic. I realize we
could use Lucene, but having a solution that scales w/ Lucene is
going to take work, AIUI, whereas a M/R job seems more
straightforward.
I'd like to be able to get this stuff committed relatively soon and
have the examples for other people. My shorter term goal is I'm
working on some demos using Wikipedia.
Thanks,
Grant