Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Pat Ferrel
facepalm, missed that. Thanks.

On Jun 10, 2014, at 4:29 PM, Ted Dunning (JIRA)  wrote:


   [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027208#comment-14027208
 ] 

Ted Dunning commented on MAHOUT-1464:
-

Matrix and Vector already have something that can be used:

{code}
   Vector counts = x.aggregateColumns(new VectorFunction() {
 @Override
 public double apply(Vector f) {
   return f.aggregate(Functions.PLUS, Functions.greater(0));
 }
   });
{code}

> Cooccurrence Analysis on Spark
> --
> 
>Key: MAHOUT-1464
>URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>Project: Mahout
> Issue Type: Improvement
> Components: Collaborative Filtering
>Environment: hadoop, spark
>   Reporter: Pat Ferrel
>   Assignee: Pat Ferrel
>Fix For: 1.0
> 
>Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
> 
> 
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)



[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027208#comment-14027208
 ] 

Ted Dunning commented on MAHOUT-1464:
-

Matrix and Vector already have something that can be used:

{code}
Vector counts = x.aggregateColumns(new VectorFunction() {
  @Override
  public double apply(Vector f) {
return f.aggregate(Functions.PLUS, Functions.greater(0));
  }
});
{code}

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027202#comment-14027202
 ] 

Pat Ferrel commented on MAHOUT-1464:


OK, good to know. So the fix above for rows is not good either, oh bother.

If I have to write specific code might it be better put in the Drm and/or 
Vector?

> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027159#comment-14027159
 ] 

Pat Ferrel edited comment on MAHOUT-1464 at 6/10/14 11:09 PM:
--

I think the same thing is happening with number of item interactions:

// Broadcast vector containing the number of interactions with each thing
val bcastNumInteractions = drmBroadcast(drmI.colSums)// sums?

This broadcasts a vector of sums. We need a getNumNonZeroElements() for column 
vectors,  actually a way to get a Vector of nonZero counts per column? We could 
get them from rows of the transposed matrix before doing the multiply of At %*% 
A or B.t %*% A in which case we’d get non-zero counts from the rows. Either way 
I don’t see a way to get a vector of these values without doing a mapBlock on 
the transposed matrix. Am I missing something?

Currently the IndexedDataset is a very thin wrapper but I could add two 
vectors, which contain number of non-zero elements for rows and columns. In 
this case I would have it extend CheckpointedDrm perhaps. Since CheckpointedDrm 
extends DrmLike it could be used in the DSL algebra directly, in which case it 
would be simple to do the right thing with these vectors as well as the two id 
dictionaries for transpose and multiply but it’s a slippery slope.

Before I go off in the wrong direction is there an existing way to get a vector 
of non-zero counts for rows or columns?



was (Author: pferrel):
I think the same thing is happening with number of item interactions:

// Broadcast vector containing the number of interactions with each thing
val bcastNumInteractions = drmBroadcast(drmI.colSums)// sums?

This broadcasts a vector of sums. We need a getNumNonZeroElements() for column 
vectors actually a way to get a Vector of nonZero counts per column? We could 
get them from rows of the transposed matrix before doing the multiply of At %*% 
A or B.t %*% A in which case we’d get nin-zero counts from the rows. Either way 
I don’t see a way to get a vector of these values without doing a mapBlock on 
the transposed matrix. Am I missing something?

Currently the IndexedDataset is a very thin wrapper but I could add two 
vectors, which contain number of non-zero elements for rows and columns. In 
this case I would have it extend CheckpointedDrm perhaps. Since CheckpointedDrm 
extends DrmLike it could be used in the DSL algebra directly, in which case it 
would be simple to do the right thing with these vectors as well as the two id 
dictionaries for transpose and multiply but it’s a slippery slope.

Before I go off in the wrong direction is there an existing way to get a vector 
of non-zero counts for rows or columns?


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Ted Dunning (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027182#comment-14027182
 ] 

Ted Dunning commented on MAHOUT-1464:
-

I don't think that numNonZero can be trusted here.  The contract it provides is 
to return an upper bound on the number of non-zeros, not a precise value.

Better to write specific code.



> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: TreeBasedRecommenders(Deprecated?)

2014-06-10 Thread Pat Ferrel
There are simple ways to do this without maintaining a separate recommender.

First you can simply cluster the input matrix of users by items. Then recommend 
items closest to the centroid of the cluster the user’s couple of items were 
in. But this seems dubious for several reasons.

Better yet (maybe controversial since I don’t know the mathematical 
justification for this) but you could cluster the indicator matrix of items by 
similar items. This is at least clustering “important” similar items.

But it is even easier than clustering if you know a couple items the user has 
preferred just get the most similar to those directly from the indicator 
matrix. The indicator matrix is organized by an item per row and each row has 
similar items by strength of similarity. Add all the rows the user has 
interacted with (using the strength values), sort, and recommend the top n. The 
in-memory item-based recommender will give you the similar items for each item 
the user preferred, all you need to do is add an sort.

To truly solve the cold start problem you have items and/or users with no 
interactions. This calls for a metadata recommender and some context. If a user 
is on a page of a product with no interactions, the metadata must tell which 
items are similar. In the case where you have a user with no interactions and 
no context, you have to rely on things like the time-worn popular and trending 
items.

You are certainly welcome here but questions like this usually go to the 
u...@mahout.apache.org list.

On Jun 10, 2014, at 4:50 AM, Sahil Sharma  wrote:

Hi,

One place where tree based recommenders(that is using hierarchical
clustering) might be useful is a cold start problem.  That is suppose a
user has only bought a few items ( say 2 or 3)  It's kind of hard to
capture that user's interests using a user-based collaborative filtering
recommender.
Also the use of item-based collaborative filtering recommender turns out to
be time consuming.
In such a setting it makes sense to cluster the items together ( using some
clustering algorithm)  and then use the user's purchased item to
recommend(based on which cluster those purchased items belong to).
On Jun 10, 2014 4:41 PM, "Sebastian Schelter"  wrote:

> Hi Sahil,
> 
> don't worry, you're not breaking any rules. We removed the tree-based
> recommenders because we have never heard of anyone using them over the
> years.
> 
> --sebastian
> 
> On 06/10/2014 09:01 AM, Sahil Sharma wrote:
> 
>> Hi,
>> 
>> Firstly I apologize if I'm breaking certain rules by mailing this way, I'm
>> new to this and would appreciate any help I could get.
>> 
>> I was just playing around with the tree-based Recommender ( which seems to
>> be deprecated in the current version "for the lack of use" ) .
>> 
>> Why was it deprecated?
>> 
>> Also, I just looked at the code, and it seems to be doing a lot of
>> redundant computations, for example we could store a matrix of
>> cluster-cluster distances ( and hence avoid recomputing the closest
>> clusters every time by updating the matrix whenever we merge two clusters)
>> and also , when trying to determine the farthest distance based similarity
>> between two clusters again the pair which realizes this could be stored ,
>> and updated upon merging so that this computation need not to repeated
>> again and again.
>> 
>> Just wondering if this repeated computation was not a reason for
>> deprecating the class ( since people might have found a slow recommender
>> "lacking use" ) .
>> 
>> Would be glad to hear the thoughts of others on this, and also implement
>> an
>> efficient version if the community agrees.
>> 
>> 
> 



[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027159#comment-14027159
 ] 

Pat Ferrel commented on MAHOUT-1464:


I think the same thing is happening with number of item interactions:

// Broadcast vector containing the number of interactions with each thing
val bcastNumInteractions = drmBroadcast(drmI.colSums)// sums?

This broadcasts a vector of sums. We need a getNumNonZeroElements() for column 
vectors actually a way to get a Vector of nonZero counts per column? We could 
get them from rows of the transposed matrix before doing the multiply of At %*% 
A or B.t %*% A in which case we’d get nin-zero counts from the rows. Either way 
I don’t see a way to get a vector of these values without doing a mapBlock on 
the transposed matrix. Am I missing something?

Currently the IndexedDataset is a very thin wrapper but I could add two 
vectors, which contain number of non-zero elements for rows and columns. In 
this case I would have it extend CheckpointedDrm perhaps. Since CheckpointedDrm 
extends DrmLike it could be used in the DSL algebra directly, in which case it 
would be simple to do the right thing with these vectors as well as the two id 
dictionaries for transpose and multiply but it’s a slippery slope.

Before I go off in the wrong direction is there an existing way to get a vector 
of non-zero counts for rows or columns?


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Time series anomaly detection MAHOUT-1423

2014-06-10 Thread Ted Dunning
Have you looked at the code?

This might also help:

http://info.mapr.com/resources_ebook_anewlook_anomalydetection.html?cid=blog

http://berlinbuzzwords.de/session/deep-learning-high-performance-time-series-databases




On Tue, Jun 10, 2014 at 2:28 AM, matteo poletti  wrote:

> Hi everybody,
>
> We are three students at TU Berlin currently enrolled in a class given by
> Sebastian Schelter on scalable data processing. In the next weeks we'll
> work on a project related to Mahout. We would like to work on time series
> anomaly detection referring to this issue:
> https://issues.apache.org/jira/browse/MAHOUT-1423.
>
> Do you have any suggestion to approach this issue? Can someone of you
> provide additional material related to this issue?
>
> Thank you!
> Andrea
> Daniel
> Matteo


Re: TreeBasedRecommenders(Deprecated?)

2014-06-10 Thread Ted Dunning
Sahil,

You say:

Also the use of item-based collaborative filtering recommender turns out to

be time consuming.


In my experience, item-based systems tend to be the fastest ones.

Perhaps we mean different things.

What I mean is similar to the approach where indicator behaviors are
computed and searched using something like a traditional search engine.





On Tue, Jun 10, 2014 at 4:50 AM, Sahil Sharma  wrote:

> Hi,
>
> One place where tree based recommenders(that is using hierarchical
> clustering) might be useful is a cold start problem.  That is suppose a
> user has only bought a few items ( say 2 or 3)  It's kind of hard to
> capture that user's interests using a user-based collaborative filtering
> recommender.
> Also the use of item-based collaborative filtering recommender turns out to
> be time consuming.
> In such a setting it makes sense to cluster the items together ( using some
> clustering algorithm)  and then use the user's purchased item to
> recommend(based on which cluster those purchased items belong to).
> On Jun 10, 2014 4:41 PM, "Sebastian Schelter"  wrote:
>
> > Hi Sahil,
> >
> > don't worry, you're not breaking any rules. We removed the tree-based
> > recommenders because we have never heard of anyone using them over the
> > years.
> >
> > --sebastian
> >
> > On 06/10/2014 09:01 AM, Sahil Sharma wrote:
> >
> >> Hi,
> >>
> >> Firstly I apologize if I'm breaking certain rules by mailing this way,
> I'm
> >> new to this and would appreciate any help I could get.
> >>
> >> I was just playing around with the tree-based Recommender ( which seems
> to
> >> be deprecated in the current version "for the lack of use" ) .
> >>
> >> Why was it deprecated?
> >>
> >> Also, I just looked at the code, and it seems to be doing a lot of
> >> redundant computations, for example we could store a matrix of
> >> cluster-cluster distances ( and hence avoid recomputing the closest
> >> clusters every time by updating the matrix whenever we merge two
> clusters)
> >> and also , when trying to determine the farthest distance based
> similarity
> >> between two clusters again the pair which realizes this could be stored
> ,
> >> and updated upon merging so that this computation need not to repeated
> >> again and again.
> >>
> >> Just wondering if this repeated computation was not a reason for
> >> deprecating the class ( since people might have found a slow recommender
> >> "lacking use" ) .
> >>
> >> Would be glad to hear the thoughts of others on this, and also implement
> >> an
> >> efficient version if the community agrees.
> >>
> >>
> >
>


[jira] [Commented] (MAHOUT-1572) blockify() to detect (naively) the data sparsity in the loaded data

2014-06-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026865#comment-14026865
 ] 

Hudson commented on MAHOUT-1572:


SUCCESS: Integrated in Mahout-Quality #2649 (See 
[https://builds.apache.org/job/Mahout-Quality/2649/])
MAHOUT-1572: blockify() to detect (naively) the data sparsity in the loaded 
data (dlyubimov: rev 8c529ccff23d419c4cb5191b0435de40d6a9831c)
* spark/src/main/scala/org/apache/mahout/sparkbindings/drm/package.scala
* CHANGELOG
* spark/src/test/scala/org/apache/mahout/sparkbindings/drm/DrmLikeSuite.scala


> blockify() to detect (naively) the data sparsity in the loaded data 
> 
>
> Key: MAHOUT-1572
> URL: https://issues.apache.org/jira/browse/MAHOUT-1572
> Project: Mahout
>  Issue Type: Bug
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> per [~ssc]:
> .bq a dense matrix is converted into a SparseRowMatrix with dense row vectors 
> by blockify(), after serialization this becomes a dense matrix in sparse 
> format (triggering OOMs)! 
> i guess we can look at first row vector and go on to either DenseMatrix or 
> SparseRowMatrix



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026831#comment-14026831
 ] 

ASF GitHub Bot commented on MAHOUT-1529:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/15#issuecomment-45654741
  
1529 is closed now. besides, it doesn't have anything to do with shell.

it's fine this is a small change, i'll merge it without issue


On Tue, Jun 10, 2014 at 11:38 AM, Anand Avati 
wrote:

> I assumed this is part of MAHOUT-1529 itself (which renamed @sc
>  to @sdc ). Let me
> resubmit with MAHOUT-1529 in the commit message?
>
> —
> Reply to this email directly or view it on GitHub
> .
>


> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> -(2) certain things in CheckpointedDRM;-
> -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.-
> -(5) drmBroadcast returns a Spark-specific Broadcast object-
> (6) Stratosphere/Flink conceptual api changes.
> *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
> need new PR for remaining things once ready.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1571) Functional Views are not serialized as dense/sparse correctly

2014-06-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026832#comment-14026832
 ] 

Hudson commented on MAHOUT-1571:


SUCCESS: Integrated in Mahout-Quality #2648 (See 
[https://builds.apache.org/job/Mahout-Quality/2648/])
MAHOUT-1571: Functional Views are not serialized as dense/sparse correctly 
(dlyubimov: rev 907781bb856b47cb7b180484c6d4b9f55a6df038)
* math/src/main/java/org/apache/mahout/math/FunctionalMatrixView.java
* math/src/test/java/org/apache/mahout/math/MatricesTest.java
* CHANGELOG


> Functional Views are not serialized as dense/sparse correctly
> -
>
> Key: MAHOUT-1571
> URL: https://issues.apache.org/jira/browse/MAHOUT-1571
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> per [~ssc] 
>  - all entries of a TransposeView (and possibly other views) of a sparse 
> matrix are serialized, resulting in OOM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1529) Finalize abstraction of distributed logical plans from backend operations

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026818#comment-14026818
 ] 

ASF GitHub Bot commented on MAHOUT-1529:


Github user avati commented on the pull request:

https://github.com/apache/mahout/pull/15#issuecomment-45653931
  
I assumed this is part of MAHOUT-1529 itself (which renamed @sc to @sdc). 
Let me resubmit with MAHOUT-1529 in the commit message?


> Finalize abstraction of distributed logical plans from backend operations
> -
>
> Key: MAHOUT-1529
> URL: https://issues.apache.org/jira/browse/MAHOUT-1529
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> We have a few situations when algorithm-facing API has Spark dependencies 
> creeping in. 
> In particular, we know of the following cases:
> -(1) checkpoint() accepts Spark constant StorageLevel directly;-
> -(2) certain things in CheckpointedDRM;-
> -(3) drmParallelize etc. routines in the "drm" and "sparkbindings" package.-
> -(5) drmBroadcast returns a Spark-specific Broadcast object-
> (6) Stratosphere/Flink conceptual api changes.
> *Current tracker:* PR #1 https://github.com/apache/mahout/pull/1 - closed, 
> need new PR for remaining things once ready.
> *Pull requests are welcome*.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1572) blockify() to detect (naively) the data sparsity in the loaded data

2014-06-10 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1572:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> blockify() to detect (naively) the data sparsity in the loaded data 
> 
>
> Key: MAHOUT-1572
> URL: https://issues.apache.org/jira/browse/MAHOUT-1572
> Project: Mahout
>  Issue Type: Bug
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> per [~ssc]:
> .bq a dense matrix is converted into a SparseRowMatrix with dense row vectors 
> by blockify(), after serialization this becomes a dense matrix in sparse 
> format (triggering OOMs)! 
> i guess we can look at first row vector and go on to either DenseMatrix or 
> SparseRowMatrix



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1572) blockify() to detect (naively) the data sparsity in the loaded data

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026809#comment-14026809
 ] 

ASF GitHub Bot commented on MAHOUT-1572:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/10


> blockify() to detect (naively) the data sparsity in the loaded data 
> 
>
> Key: MAHOUT-1572
> URL: https://issues.apache.org/jira/browse/MAHOUT-1572
> Project: Mahout
>  Issue Type: Bug
>Reporter: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> per [~ssc]:
> .bq a dense matrix is converted into a SparseRowMatrix with dense row vectors 
> by blockify(), after serialization this becomes a dense matrix in sparse 
> format (triggering OOMs)! 
> i guess we can look at first row vector and go on to either DenseMatrix or 
> SparseRowMatrix



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1571) Functional Views are not serialized as dense/sparse correctly

2014-06-10 Thread Dmitriy Lyubimov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Lyubimov updated MAHOUT-1571:
-

Resolution: Fixed
Status: Resolved  (was: Patch Available)

> Functional Views are not serialized as dense/sparse correctly
> -
>
> Key: MAHOUT-1571
> URL: https://issues.apache.org/jira/browse/MAHOUT-1571
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> per [~ssc] 
>  - all entries of a TransposeView (and possibly other views) of a sparse 
> matrix are serialized, resulting in OOM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1571) Functional Views are not serialized as dense/sparse correctly

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026766#comment-14026766
 ] 

ASF GitHub Bot commented on MAHOUT-1571:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/9


> Functional Views are not serialized as dense/sparse correctly
> -
>
> Key: MAHOUT-1571
> URL: https://issues.apache.org/jira/browse/MAHOUT-1571
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> per [~ssc] 
>  - all entries of a TransposeView (and possibly other views) of a sparse 
> matrix are serialized, resulting in OOM



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026742#comment-14026742
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45646307
  
i assume this is current PR for MAHOUT-1464?


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026738#comment-14026738
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45646072
  
the cooccurrence analysis code should go to the math-scala not spark 
module, as it is independent of the underlying engine.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026719#comment-14026719
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov closed the pull request at:

https://github.com/apache/mahout/pull/8


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Sebastian Schelter

Hi Pat,

We truncate the indicators to the top-k and you don't want the 
self-comparison in there. So I don't see a reason to not exclude it as 
early as possible.


--sebatian

On 06/10/2014 05:28 PM, Pat Ferrel wrote:

Still getting the wrong values with non-boolean input so I’ll continue to look 
at.

Another question is: computeIndicators seems to exclude self-comparison during 
A’A and, of course, not for B’A. Since this returns the indicator matrix for 
the general case shouldn’t it include those values? Seems like they should be 
filtered out in the output phase if anywhere and that by option. If we were 
actually returning a multiply we’d include those.

 // exclude co-occurrences of the item with itself
 if (crossCooccurrence || thingB != thingA) {

On Jun 10, 2014, at 1:49 AM, Sebastian Schelter  wrote:

Oh good catch! I had an extra binarize method before, so that the data was 
already binary. I merged that into the downsample code and must have overlooked 
that thing. You are right, numNonZeros is the way to go!


On 06/10/2014 01:11 AM, Ted Dunning wrote:

Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:



 [
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]

Pat Ferrel commented on MAHOUT-1464:


seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?

 // Downsample the interaction vector of each user
 for (userIndex <- 0 until keys.size) {

   val interactionsOfUser = block(userIndex, ::) // this is a Vector
   // if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
   val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements()  // should do this I think

   val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser

   interactionsOfUser.nonZeroes().foreach { elem =>
 val numInteractionsWithThing = numInteractions(elem.index)
 val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing

 if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
   // We ignore the original interaction value and create a
binary 0-1 matrix
   // as we only consider whether interactions happened or did
not happen
   downsampledBlock(userIndex, elem.index) = 1
 }
   }



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,

MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh



Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)

that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.

Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence

has several applications including cross-action recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)










Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Pat Ferrel
Still getting the wrong values with non-boolean input so I’ll continue to look 
at.

Another question is: computeIndicators seems to exclude self-comparison during 
A’A and, of course, not for B’A. Since this returns the indicator matrix for 
the general case shouldn’t it include those values? Seems like they should be 
filtered out in the output phase if anywhere and that by option. If we were 
actually returning a multiply we’d include those.

// exclude co-occurrences of the item with itself
if (crossCooccurrence || thingB != thingA) {

On Jun 10, 2014, at 1:49 AM, Sebastian Schelter  wrote:

Oh good catch! I had an extra binarize method before, so that the data was 
already binary. I merged that into the downsample code and must have overlooked 
that thing. You are right, numNonZeros is the way to go!


On 06/10/2014 01:11 AM, Ted Dunning wrote:
> Sounds like a very plausible root cause.
> 
> 
> 
> 
> 
> On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:
> 
>> 
>> [
>> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
>> ]
>> 
>> Pat Ferrel commented on MAHOUT-1464:
>> 
>> 
>> seems like the downsampleAndBinarize method is returning the wrong values.
>> It is actually summing the values where it should be counting the non-zero
>> elements?
>> 
>> // Downsample the interaction vector of each user
>> for (userIndex <- 0 until keys.size) {
>> 
>>   val interactionsOfUser = block(userIndex, ::) // this is a Vector
>>   // if the values are non-boolean the sum will not be the number
>> of interactions it will be a sum of strength-of-interaction, right?
>>   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
>> this sum strength of interactions?
>>   val numInteractionsOfUser =
>> interactionsOfUser.getNumNonZeroElements()  // should do this I think
>> 
>>   val perUserSampleRate = math.min(maxNumInteractions,
>> numInteractionsOfUser) / numInteractionsOfUser
>> 
>>   interactionsOfUser.nonZeroes().foreach { elem =>
>> val numInteractionsWithThing = numInteractions(elem.index)
>> val perThingSampleRate = math.min(maxNumInteractions,
>> numInteractionsWithThing) / numInteractionsWithThing
>> 
>> if (random.nextDouble() <= math.min(perUserSampleRate,
>> perThingSampleRate)) {
>>   // We ignore the original interaction value and create a
>> binary 0-1 matrix
>>   // as we only consider whether interactions happened or did
>> not happen
>>   downsampledBlock(userIndex, elem.index) = 1
>> }
>>   }
>> 
>> 
>>> Cooccurrence Analysis on Spark
>>> --
>>> 
>>> Key: MAHOUT-1464
>>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>> Project: Mahout
>>>  Issue Type: Improvement
>>>  Components: Collaborative Filtering
>>> Environment: hadoop, spark
>>>Reporter: Pat Ferrel
>>>Assignee: Pat Ferrel
>>> Fix For: 1.0
>>> 
>>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>> run-spark-xrsj.sh
>>> 
>>> 
>>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>> a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>> has several applications including cross-action recommendations.
>> 
>> 
>> 
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>> 
> 




[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14026549#comment-14026549
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/8#issuecomment-45626614
  
Go ahead and hit the butto. Still have a bit more to do here.


On Jun 9, 2014, at 6:47 PM, Dmitriy Lyubimov  
wrote:

you can close -- but since i originated the PR, it is easier for me (I have 
access to the "close" button on it while everyone else would have to use 
"close apache/mahout#8" commit to do the same.) 


On Mon, Jun 9, 2014 at 5:20 PM, Pat Ferrel  
wrote: 

> According to the instructions I merge from my branch anyway. I can close 
> it right? The instruction for closing without merging? 
> 
> I assume you got my mail about finding the blocker now there are some 
> questions about the cooccurrence algo itself. 
> 
> — 
> Reply to this email directly or view it on GitHub 
> . 
>
—
Reply to this email directly or view it on GitHub.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [mahout] MAHOUT-1464 Cooccurrence Analysis on Spark (#8)

2014-06-10 Thread Pat Ferrel
Go ahead and hit the butto. Still have a bit more to do here.


On Jun 9, 2014, at 6:47 PM, Dmitriy Lyubimov  wrote:

you can close -- but since i originated the PR, it is easier for me (I have 
access to the "close" button on it while everyone else would have to use 
"close apache/mahout#8" commit to do the same.) 


On Mon, Jun 9, 2014 at 5:20 PM, Pat Ferrel  wrote: 

> According to the instructions I merge from my branch anyway. I can close 
> it right? The instruction for closing without merging? 
> 
> I assume you got my mail about finding the blocker now there are some 
> questions about the cooccurrence algo itself. 
> 
> — 
> Reply to this email directly or view it on GitHub 
> . 
>
—
Reply to this email directly or view it on GitHub.




Re: TreeBasedRecommenders(Deprecated?)

2014-06-10 Thread Sahil Sharma
Hi,

One place where tree based recommenders(that is using hierarchical
clustering) might be useful is a cold start problem.  That is suppose a
user has only bought a few items ( say 2 or 3)  It's kind of hard to
capture that user's interests using a user-based collaborative filtering
recommender.
Also the use of item-based collaborative filtering recommender turns out to
be time consuming.
In such a setting it makes sense to cluster the items together ( using some
clustering algorithm)  and then use the user's purchased item to
recommend(based on which cluster those purchased items belong to).
On Jun 10, 2014 4:41 PM, "Sebastian Schelter"  wrote:

> Hi Sahil,
>
> don't worry, you're not breaking any rules. We removed the tree-based
> recommenders because we have never heard of anyone using them over the
> years.
>
> --sebastian
>
> On 06/10/2014 09:01 AM, Sahil Sharma wrote:
>
>> Hi,
>>
>> Firstly I apologize if I'm breaking certain rules by mailing this way, I'm
>> new to this and would appreciate any help I could get.
>>
>> I was just playing around with the tree-based Recommender ( which seems to
>> be deprecated in the current version "for the lack of use" ) .
>>
>> Why was it deprecated?
>>
>> Also, I just looked at the code, and it seems to be doing a lot of
>> redundant computations, for example we could store a matrix of
>> cluster-cluster distances ( and hence avoid recomputing the closest
>> clusters every time by updating the matrix whenever we merge two clusters)
>> and also , when trying to determine the farthest distance based similarity
>> between two clusters again the pair which realizes this could be stored ,
>> and updated upon merging so that this computation need not to repeated
>> again and again.
>>
>> Just wondering if this repeated computation was not a reason for
>> deprecating the class ( since people might have found a slow recommender
>> "lacking use" ) .
>>
>> Would be glad to hear the thoughts of others on this, and also implement
>> an
>> efficient version if the community agrees.
>>
>>
>


Re: TreeBasedRecommenders(Deprecated?)

2014-06-10 Thread Sebastian Schelter

Hi Sahil,

don't worry, you're not breaking any rules. We removed the tree-based 
recommenders because we have never heard of anyone using them over the 
years.


--sebastian

On 06/10/2014 09:01 AM, Sahil Sharma wrote:

Hi,

Firstly I apologize if I'm breaking certain rules by mailing this way, I'm
new to this and would appreciate any help I could get.

I was just playing around with the tree-based Recommender ( which seems to
be deprecated in the current version "for the lack of use" ) .

Why was it deprecated?

Also, I just looked at the code, and it seems to be doing a lot of
redundant computations, for example we could store a matrix of
cluster-cluster distances ( and hence avoid recomputing the closest
clusters every time by updating the matrix whenever we merge two clusters)
and also , when trying to determine the farthest distance based similarity
between two clusters again the pair which realizes this could be stored ,
and updated upon merging so that this computation need not to repeated
again and again.

Just wondering if this repeated computation was not a reason for
deprecating the class ( since people might have found a slow recommender
"lacking use" ) .

Would be glad to hear the thoughts of others on this, and also implement an
efficient version if the community agrees.





Time series anomaly detection MAHOUT-1423

2014-06-10 Thread matteo poletti
Hi everybody,

We are three students at TU Berlin currently enrolled in a class given by 
Sebastian Schelter on scalable data processing. In the next weeks we'll work on 
a project related to Mahout. We would like to work on time series anomaly 
detection referring to this issue: 
https://issues.apache.org/jira/browse/MAHOUT-1423.

Do you have any suggestion to approach this issue? Can someone of you provide 
additional material related to this issue?

Thank you!
Andrea
Daniel
Matteo

Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-10 Thread Sebastian Schelter
Oh good catch! I had an extra binarize method before, so that the data 
was already binary. I merged that into the downsample code and must have 
overlooked that thing. You are right, numNonZeros is the way to go!



On 06/10/2014 01:11 AM, Ted Dunning wrote:

Sounds like a very plausible root cause.





On Mon, Jun 9, 2014 at 4:03 PM, Pat Ferrel (JIRA)  wrote:



 [
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025893#comment-14025893
]

Pat Ferrel commented on MAHOUT-1464:


seems like the downsampleAndBinarize method is returning the wrong values.
It is actually summing the values where it should be counting the non-zero
elements?

 // Downsample the interaction vector of each user
 for (userIndex <- 0 until keys.size) {

   val interactionsOfUser = block(userIndex, ::) // this is a Vector
   // if the values are non-boolean the sum will not be the number
of interactions it will be a sum of strength-of-interaction, right?
   // val numInteractionsOfUser = interactionsOfUser.sum // doesn't
this sum strength of interactions?
   val numInteractionsOfUser =
interactionsOfUser.getNumNonZeroElements()  // should do this I think

   val perUserSampleRate = math.min(maxNumInteractions,
numInteractionsOfUser) / numInteractionsOfUser

   interactionsOfUser.nonZeroes().foreach { elem =>
 val numInteractionsWithThing = numInteractions(elem.index)
 val perThingSampleRate = math.min(maxNumInteractions,
numInteractionsWithThing) / numInteractionsWithThing

 if (random.nextDouble() <= math.min(perUserSampleRate,
perThingSampleRate)) {
   // We ignore the original interaction value and create a
binary 0-1 matrix
   // as we only consider whether interactions happened or did
not happen
   downsampledBlock(userIndex, elem.index) = 1
 }
   }



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,

MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh



Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)

that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.

Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence

has several applications including cross-action recommendations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)







TreeBasedRecommenders(Deprecated?)

2014-06-10 Thread Sahil Sharma
Hi,

Firstly I apologize if I'm breaking certain rules by mailing this way, I'm
new to this and would appreciate any help I could get.

I was just playing around with the tree-based Recommender ( which seems to
be deprecated in the current version "for the lack of use" ) .

Why was it deprecated?

Also, I just looked at the code, and it seems to be doing a lot of
redundant computations, for example we could store a matrix of
cluster-cluster distances ( and hence avoid recomputing the closest
clusters every time by updating the matrix whenever we merge two clusters)
and also , when trying to determine the farthest distance based similarity
between two clusters again the pair which realizes this could be stored ,
and updated upon merging so that this computation need not to repeated
again and again.

Just wondering if this repeated computation was not a reason for
deprecating the class ( since people might have found a slow recommender
"lacking use" ) .

Would be glad to hear the thoughts of others on this, and also implement an
efficient version if the community agrees.

-- 
Best,
Sahil

Sophomore, IIT  Madras, India
ᐧ