On Jul 18, 2011, at 6:09 PM, Sean Owen wrote:

> Right! but how do you do that if you only saved co-occurrence counts?
> 
> You can surely pull a very similarly-shaped trick to calculate the
> cosine measure; that's exactly what this paper is doing in fact. But
> it's a different computation.
> 
> Right now the job saves *all* the info it might need to calculate any
> of these things later. And that's heavy.

Yes.  That is the thing I am questioning.  Do we need to do that?  I'm arguing 
that doing so makes for an algorithm that doesn't scale, even if it is correct.

> 
> On Mon, Jul 18, 2011 at 11:06 PM, Jake Mannix <[email protected]> wrote:
>> On Mon, Jul 18, 2011 at 2:53 PM, Sean Owen <[email protected]> wrote:
>> 
>>> How do you implement, for instance, the cosine similarity with this output?
>>> That's the intent behind preserving this info, which is surely a lot
>>> to preserve.
>>> 
>> 
>> Sorry to jump in the middle of this, but cosine is not too hard to use nice
>> combiners, as it can be done by first normalizing the rows and then
>> doing my ubiquitous "outer product of columns" trick on the resultant
>> corpus (this latter job uses combiners easily because the mappers do all
>> multiplications, and all reducers are simply sums, and thus are commutative
>> and associative).
>> 
>> Not sure about the other fancy similarities.

--------------------------
Grant Ingersoll



Reply via email to