You mentioned a matrix decomposition technique.  Should I run the SVD job
instead of RowSimilarityJob?  I found this page describes the SVD job and
it seems like that's what I should try.  However, I notice the SVD job does
not need a similarity class as input.  Would the SVD job returns a DRM with
Similarity vectors?  Also, I am not sure how to determine the decomposition
rank.  In the book example above, would the rank be 600?

https://mahout.apache.org/users/dim-reduction/dimensional-reduction.html


I see your point on using other information (ie browsing history) to
"boost" correlation.   This is something I will try after my demo deadline
(or if I could not find a way to solve the DRM sparsity problem).   BTW, I
took the Solr/Mahout combo approach you described in your book.  It works
very well for the cases where a mahout Similarity vector is present.

Thanks for your help.  Much appreciated
Edith


On Tue, Jul 22, 2014 at 9:12 AM, Ted Dunning <ted.dunn...@gmail.com> wrote:

> Having such sparse data is going to make it very difficult to do anything
> at all.  For instance, if you have only one non-zero in a row, there is no
> cooccurrence to analyze and that row should be deleted.  With only two
> non-zeros, you have to be very careful about drawing any inferences.
>
> The other aspect of sparsity is that you only have 600 books.  That may
> mean that you would be better served by using a matrix decomposition
> technique.
>
> One question I have is whether you have other actions besides purchase that
> indicate engagement with the books.  Can you record which users browse a
> certain book?  How about whether they have read the reviews?
>
>
>
> On Tue, Jul 22, 2014 at 8:59 AM, Edith Au <edith...@gmail.com> wrote:
>
> > Hi
> >
> > My RowSimiliarityJob returns a DRM with some rows missing.   The input
> file
> > is very sparse.  there are about 600 columns but only 1 - 6 would have a
> > value (for each row).   The output file has some rows missing.  The
> missing
> > rows are the ones with only 1 - 2 values filled.  Not all rows with 1 or
> 2
> > values are missing, just some of them.  And the missing rows are not
> always
> > the same for each RowSimilarityJob execution
> >
> > What I would like to achieve is to find the relative strength between
> > rows.  For example, if there are 600 books, user1  and user2 like only
> one
> > book (the same book), then there should be a correlation between these 2
> > users.
> >
> > But my RowSimilarityJob output file seems to skip some of the users with
> > sparse preferences.  I am running the job locally with 4 options: input,
> > output, SIMILARITY_LOGLIKELIHOOD, and temp dir.   What would be the right
> > approach to pick up similarity between users with sparse preferences?
> >
> > Thanks!
> >
> > Edith
> >
>

Reply via email to