>
> Another thing to consider and here my ignorance shows through… The U*S 
> (equivalent to A*V) transform to V-space must be reversible so that humans 
> can see results in terms of of the original term-space. Weights base on the 
> new basis are not human understandable really. But setting me straight here 
> may be another conversation.

Yes. fold-in and fold-out of new observations.

In my experience, fold-in is usually of much smaller volume at a time
and  requires a good index on V. Since Mahout is essentially a batch
analytics it is not good fit for Mahout IMO per se. (not that it is
that difficult to build on your own either).

Fold-out ... i am not particularly sure what it would be useful for
(perhaps for cluster centroids in your case, yes) but it would be a
small job too, compared to initial corpus data, not a big one.
Therefore, not much interest for having that in Mahout either. Result
post-processing once it is not so bulk and even not so science-laden,
is probably of little interest for this project..

I was giving it a though for some time now and had this idea to have a
good Mahout-R integration package for result post processing and
plotting and was somewhat short on pragmatic interest and time to do
that. Since fold-in or fold-out doesn't generate a lot of flops, doing
that in R actually is more than feasible. And also a whole lot more
customizable.

>
> On Sep 7, 2012, at 4:52 PM, Dmitriy Lyubimov <[email protected]> wrote:
>
> I can do a patch to propagate names of named vectors from A to U too
> if that's a requirement for what you do. But we need to make sure it
> solves your problem. i am still not sure what are IDs in your
> definition and what is required for k-means.
>
> Thinking of that, it's probably a worthy patch anyway. I'll write
> something up along with API changes for A*Sigma outputs. I think since
> there are so many output options, they should be redesigned not to be
> mutually exclusive.
>
> On Fri, Sep 7, 2012 at 4:37 PM, Pat Ferrel <[email protected]> wrote:
>> Yes, I would love to use namedvectors. But no matter doing a key to row 
>> lookup is easy enough.
>>
>> I'm not getting any id at all in the cluster data, not even a key for a row.
>>
>> I'm beginning to think this is a clustering problem since rowsimilarity at 
>> least gives me row keys to identify objects associated with an object.
>>
>> On Sep 7, 2012, at 2:59 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>
>> yeah seq2sparse seems to have -nv option now to churn out named
>> vectors too. It doesn't seem to be listed in the MIA book though.
>>
>> On Fri, Sep 7, 2012 at 2:55 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>> On Fri, Sep 7, 2012 at 2:27 PM, Dmitriy Lyubimov <[email protected]> wrote:
>>> Sequence file keys, on the other hand, is
>>>> what populated by seq2sparse output, so they are useful for mapping
>>>> results to original documents.
>>>
>>> Although honestly i am not so sure about seq2sparse anymore. There has
>>> been some time since i looked at this for the last time.
>>
>

Reply via email to