Having such sparse data is going to make it very difficult to do anything
at all.  For instance, if you have only one non-zero in a row, there is no
cooccurrence to analyze and that row should be deleted.  With only two
non-zeros, you have to be very careful about drawing any inferences.

The other aspect of sparsity is that you only have 600 books.  That may
mean that you would be better served by using a matrix decomposition
technique.

One question I have is whether you have other actions besides purchase that
indicate engagement with the books.  Can you record which users browse a
certain book?  How about whether they have read the reviews?



On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <edith...@gmail.com> wrote:

> Hi
>
> My RowSimiliarityJob returns a DRM with some rows missing.   The input file
> is very sparse.  there are about 600 columns but only 1 - 6 would have a
> value (for each row).   The output file has some rows missing.  The missing
> rows are the ones with only 1 - 2 values filled.  Not all rows with 1 or 2
> values are missing, just some of them.  And the missing rows are not always
> the same for each RowSimilarityJob execution
>
> What I would like to achieve is to find the relative strength between
> rows.  For example, if there are 600 books, user1  and user2 like only one
> book (the same book), then there should be a correlation between these 2
> users.
>
> But my RowSimilarityJob output file seems to skip some of the users with
> sparse preferences.  I am running the job locally with 4 options: input,
> output, SIMILARITY_LOGLIKELIHOOD, and temp dir.   What would be the right
> approach to pick up similarity between users with sparse preferences?
>
> Thanks!
>
> Edith
>

Reply via email to