This is not only super sparse, but it is super skewed.  Having 71% of the
votes from 0.05% of the users is (a) suspicious relative to possible
spamming of the data set and (b) more skewed than any other real dataset I
have heard of.  I would be tempted to plot the rank vs number of articles
and look for the point where users seem to be above the Zipf curve.  I would
just eliminate those users from the data and see what happens.

LDA should definitely be on your horizon for testing with such sparse data
as Jake points out.

On Wed, Feb 24, 2010 at 10:17 PM, Jake Mannix <[email protected]> wrote:

>
> > 470,640 users
> > 1,606,789 articles
> > 13,281,941 votes (0.00175% nonzero)
> > 43% of users voted on 3 or fewer articles (one vote per month)
> > 23% of users voted on more than 10 articles (87% of the data)
> > 0.05% of users voted on more than 100 articles (71% of votes)
> >
>
> Interesting data set - far more "items" than users, and *really*
> sparse.  SVD could definitely give you super crappy results for
> this data set, if my intuition is right.




-- 
Ted Dunning, CTO
DeepDyve

Reply via email to