This is not only super sparse, but it is super skewed. Having 71% of the votes from 0.05% of the users is (a) suspicious relative to possible spamming of the data set and (b) more skewed than any other real dataset I have heard of. I would be tempted to plot the rank vs number of articles and look for the point where users seem to be above the Zipf curve. I would just eliminate those users from the data and see what happens.
LDA should definitely be on your horizon for testing with such sparse data as Jake points out. On Wed, Feb 24, 2010 at 10:17 PM, Jake Mannix <[email protected]> wrote: > > > 470,640 users > > 1,606,789 articles > > 13,281,941 votes (0.00175% nonzero) > > 43% of users voted on 3 or fewer articles (one vote per month) > > 23% of users voted on more than 10 articles (87% of the data) > > 0.05% of users voted on more than 100 articles (71% of votes) > > > > Interesting data set - far more "items" than users, and *really* > sparse. SVD could definitely give you super crappy results for > this data set, if my intuition is right. -- Ted Dunning, CTO DeepDyve
