On Jul 18, 2011, at 6:09 PM, Sean Owen wrote: > Right! but how do you do that if you only saved co-occurrence counts? > > You can surely pull a very similarly-shaped trick to calculate the > cosine measure; that's exactly what this paper is doing in fact. But > it's a different computation. > > Right now the job saves *all* the info it might need to calculate any > of these things later. And that's heavy.
Yes. That is the thing I am questioning. Do we need to do that? I'm arguing that doing so makes for an algorithm that doesn't scale, even if it is correct. > > On Mon, Jul 18, 2011 at 11:06 PM, Jake Mannix <[email protected]> wrote: >> On Mon, Jul 18, 2011 at 2:53 PM, Sean Owen <[email protected]> wrote: >> >>> How do you implement, for instance, the cosine similarity with this output? >>> That's the intent behind preserving this info, which is surely a lot >>> to preserve. >>> >> >> Sorry to jump in the middle of this, but cosine is not too hard to use nice >> combiners, as it can be done by first normalizing the rows and then >> doing my ubiquitous "outer product of columns" trick on the resultant >> corpus (this latter job uses combiners easily because the mappers do all >> multiplications, and all reducers are simply sums, and thus are commutative >> and associative). >> >> Not sure about the other fancy similarities. -------------------------- Grant Ingersoll
