[jira] [Updated] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly

2014-06-12 Thread Xiaomeng Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaomeng Huang updated MAHOUT-1579:
---

Description: 
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
If we want to deal with the data in hdfs, we must run mapred job. 
It's necessay to implement a data model which can load data from hadoop 
filesystem directly.

  was:
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
If we want to deal with the data in hdfs, we must run mapred job. 
And the distributed algorithm can only process data form like [userID: ItemID1, 
ItemID2, ItemID3...]
It's necessay to implement a data model which can load data from hadoop 
filesystem directly.
If the data is not very large, we can use this data model and process data form 
like [userID,itemID,preference]


> Implement a datamodel which can load data from hadoop filesystem directly
> -
>
> Key: MAHOUT-1579
> URL: https://issues.apache.org/jira/browse/MAHOUT-1579
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Xiaomeng Huang
>Priority: Minor
> Attachments: Mahout-1579.patch
>
>
> As we all know, FileDataModel can only load data from local filesystem.
> But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
> If we want to deal with the data in hdfs, we must run mapred job. 
> It's necessay to implement a data model which can load data from hadoop 
> filesystem directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly

2014-06-12 Thread Xiaomeng Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaomeng Huang updated MAHOUT-1579:
---

Description: 
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
If we want to deal with the data in hdfs, we must run mapred job. 
And the distributed algorithm can only process data form like [userID: ItemID1, 
ItemID2, ItemID3...]
It's necessay to implement a data model which can load data from hadoop 
filesystem directly.
If the data is not very large, we can use this data model and process data form 
like [userID,itemID,preference]

  was:
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
If we want to deal with the data in hdfs, we must run mapred job. 
And the distributed algorithm can only process data form like [userID: ItemID1, 
ItemID2, ItemID3...]
It's necessay to implement a data model which can load data from hadoop 
filesystem directly,
so that we can process data form like [userID,itemID,preference]


> Implement a datamodel which can load data from hadoop filesystem directly
> -
>
> Key: MAHOUT-1579
> URL: https://issues.apache.org/jira/browse/MAHOUT-1579
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Xiaomeng Huang
>Priority: Minor
> Attachments: Mahout-1579.patch
>
>
> As we all know, FileDataModel can only load data from local filesystem.
> But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
> If we want to deal with the data in hdfs, we must run mapred job. 
> And the distributed algorithm can only process data form like [userID: 
> ItemID1, ItemID2, ItemID3...]
> It's necessay to implement a data model which can load data from hadoop 
> filesystem directly.
> If the data is not very large, we can use this data model and process data 
> form like [userID,itemID,preference]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly

2014-06-12 Thread Xiaomeng Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaomeng Huang updated MAHOUT-1579:
---

Description: 
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
If we want to deal with the data in hdfs, we must run mapred job. 
And the distributed algorithm can only process data form like [userID: ItemID1, 
ItemID2, ItemID3...]
It's necessay to implement a data model which can load data from hadoop 
filesystem directly,
so that we can process data form like [userID,itemID,preference]

  was:
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
If we want to deal with the data in hdfs, we must run mapred job. And the 
distributed work can only process data form like [userID: ItemID1, ItemID2, 
ItemID3...]
It's necessay to implement a datamodel which can load data from hadoop 
filesystem directly, so that we can process data form like 
[userID,itemID,preference]


> Implement a datamodel which can load data from hadoop filesystem directly
> -
>
> Key: MAHOUT-1579
> URL: https://issues.apache.org/jira/browse/MAHOUT-1579
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Xiaomeng Huang
>Priority: Minor
> Attachments: Mahout-1579.patch
>
>
> As we all know, FileDataModel can only load data from local filesystem.
> But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
> If we want to deal with the data in hdfs, we must run mapred job. 
> And the distributed algorithm can only process data form like [userID: 
> ItemID1, ItemID2, ItemID3...]
> It's necessay to implement a data model which can load data from hadoop 
> filesystem directly,
> so that we can process data form like [userID,itemID,preference]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly

2014-06-12 Thread Xiaomeng Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaomeng Huang updated MAHOUT-1579:
---

Description: 
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
If we want to deal with the data in hdfs, we must run mapred job. And the 
distributed work can only process data form like [userID: ItemID1, ItemID2, 
ItemID3...]
It's necessay to implement a datamodel which can load data from hadoop 
filesystem directly, so that we can process data form like 
[userID,itemID,preference]

  was:
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
If we want to deal with the data in hdfs, we must run mapred job.
It's necessay to implement a datamodel which can load data from hadoop 
filesystem directly.


> Implement a datamodel which can load data from hadoop filesystem directly
> -
>
> Key: MAHOUT-1579
> URL: https://issues.apache.org/jira/browse/MAHOUT-1579
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Xiaomeng Huang
>Priority: Minor
> Attachments: Mahout-1579.patch
>
>
> As we all know, FileDataModel can only load data from local filesystem.
> But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
> If we want to deal with the data in hdfs, we must run mapred job. And the 
> distributed work can only process data form like [userID: ItemID1, ItemID2, 
> ItemID3...]
> It's necessay to implement a datamodel which can load data from hadoop 
> filesystem directly, so that we can process data form like 
> [userID,itemID,preference]



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly

2014-06-12 Thread Xiaomeng Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaomeng Huang updated MAHOUT-1579:
---

Status: Patch Available  (was: Open)

> Implement a datamodel which can load data from hadoop filesystem directly
> -
>
> Key: MAHOUT-1579
> URL: https://issues.apache.org/jira/browse/MAHOUT-1579
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Xiaomeng Huang
>Priority: Minor
> Attachments: Mahout-1579.patch
>
>
> As we all know, FileDataModel can only load data from local filesystem.
> But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
> If we want to deal with the data in hdfs, we must run mapred job.
> It's necessay to implement a datamodel which can load data from hadoop 
> filesystem directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly

2014-06-12 Thread Xiaomeng Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaomeng Huang updated MAHOUT-1579:
---

Description: 
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
If we want to deal with the data in hdfs, we must run mapred job.
It's necessay to implement a datamodel which can load data from hadoop 
filesystem directly.

  was:
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
It's necessay to implement a datamodel which can load data from hadoop 
filesystem directly.


> Implement a datamodel which can load data from hadoop filesystem directly
> -
>
> Key: MAHOUT-1579
> URL: https://issues.apache.org/jira/browse/MAHOUT-1579
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Xiaomeng Huang
>Priority: Minor
> Attachments: Mahout-1579.patch
>
>
> As we all know, FileDataModel can only load data from local filesystem.
> But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
> If we want to deal with the data in hdfs, we must run mapred job.
> It's necessay to implement a datamodel which can load data from hadoop 
> filesystem directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly

2014-06-12 Thread Xiaomeng Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaomeng Huang updated MAHOUT-1579:
---

Description: 
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
It's necessay to implement a datamodel which can load data from hadoop 
filesystem directly.

  was:
As we all know, FileDataModel can only load data from local filesystem.
But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
It's necessay to implement a datamodel which can load data from hadoop 
filesystem directly.
I have a impovement in https://github.com/HuangXiaomeng/mahout


> Implement a datamodel which can load data from hadoop filesystem directly
> -
>
> Key: MAHOUT-1579
> URL: https://issues.apache.org/jira/browse/MAHOUT-1579
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Xiaomeng Huang
>Priority: Minor
> Attachments: Mahout-1579.patch
>
>
> As we all know, FileDataModel can only load data from local filesystem.
> But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
> It's necessay to implement a datamodel which can load data from hadoop 
> filesystem directly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAHOUT-1579) Implement a datamodel which can load data from hadoop filesystem directly

2014-06-12 Thread Xiaomeng Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaomeng Huang updated MAHOUT-1579:
---

Attachment: Mahout-1579.patch

> Implement a datamodel which can load data from hadoop filesystem directly
> -
>
> Key: MAHOUT-1579
> URL: https://issues.apache.org/jira/browse/MAHOUT-1579
> Project: Mahout
>  Issue Type: Improvement
>Reporter: Xiaomeng Huang
>Priority: Minor
> Attachments: Mahout-1579.patch
>
>
> As we all know, FileDataModel can only load data from local filesystem.
> But the big-data are usually stored in hadoop filesystem(e.g. hdfs).
> It's necessay to implement a datamodel which can load data from hadoop 
> filesystem directly.
> I have a impovement in https://github.com/HuangXiaomeng/mahout



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030125#comment-14030125
 ] 

Hudson commented on MAHOUT-1464:


SUCCESS: Integrated in Mahout-Quality #2653 (See 
[https://builds.apache.org/job/Mahout-Quality/2653/])
MAHOUT-1464 Cooccurrence Analysis on Spark (pat) closes apache/mahout#12 (pat: 
rev c1ca30872c622e513e49fc1bb111bc4b8a527d3b)
* spark/src/test/scala/org/apache/mahout/cf/CooccurrenceAnalysisSuite.scala
* math-scala/src/main/scala/org/apache/mahout/math/drm/DistributedEngine.scala
* spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala
* math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala
* math/src/main/java/org/apache/mahout/math/MurmurHash.java
* math-scala/src/main/scala/org/apache/mahout/math/drm/CheckpointedOps.scala
* 
spark/src/test/scala/org/apache/mahout/sparkbindings/drm/RLikeDrmOpsSuite.scala
* spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala
* 
math-scala/src/test/scala/org/apache/mahout/math/scalabindings/MatrixOpsSuite.scala
* CHANGELOG


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030109#comment-14030109
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/12


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Taking a dependency on commons.math

2014-06-12 Thread Ravi Mummulla
Is someone looking at refactoring several of the Mahout algorithms to take
a dependency on commons.math where there are overlaps? For example, matrix
decomposition from the commons.math linear algebra library, some of the
clustering algorithms (k-means, fuzzy k-means, etc.) from commons.math
machine learning library.

Thanks.


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029631#comment-14029631
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user tdunning commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45934491
  
I hate abbreviations.  If you are asking about naming, use the long name.

If you can assure binary, then going with what we already have would be
nice.



On Thu, Jun 12, 2014 at 10:34 AM, Pat Ferrel 
wrote:

> numNonZeroElementsPerColumn? vs colSums?
>
> OK
>
> —
> Reply to this email directly or view it on GitHub
> .
>


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029474#comment-14029474
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45923020
  
numNonZeroElementsPerColumn? vs colSums?

OK


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029469#comment-14029469
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13714354
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.cf
+
+import org.apache.mahout.math._
+import scalabindings._
+import RLikeOps._
+import drm._
+import RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+import scala.collection.JavaConversions._
+import org.apache.mahout.math.stats.LogLikelihood
+import collection._
+import org.apache.mahout.common.RandomUtils
+import org.apache.mahout.math.function.{VectorFunction, Functions}
+
+
+/**
+ * based on "Ted Dunnning & Ellen Friedman: Practical Machine Learning, 
Innovations in Recommendation",
+ * available at http://www.mapr.com/practical-machine-learning
+ *
+ * see also "Sebastian Schelter, Christoph Boden, Volker Markl:
+ * Scalable Similarity-Based Neighborhood Methods with MapReduce
+ * ACM Conference on Recommender Systems 2012"
+ */
+object CooccurrenceAnalysis extends Serializable {
+
+  /** Compares (Int,Double) pairs by the second value */
+  private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case 
((_, score1), (_, score2)) => score1 > score2}
+
+  def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingItemsPerThing: Int = 50,
+maxNumInteractions: Int = 500, drmBs: 
Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = {
+
+implicit val distributedContext = drmARaw.context
+
+// Apply selective downsampling, pin resulting matrix
+val drmA = sampleDownAndBinarize(drmARaw, randomSeed, 
maxNumInteractions)
+
+// num users, which equals the maximum number of interactions per item
+val numUsers = drmA.nrow.toInt
+
+// Compute & broadcast the number of interactions per thing in A
+val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)
+
+// Compute co-occurrence matrix A'A
+val drmAtA = drmA.t %*% drmA
+
+// Compute loglikelihood scores and sparsify the resulting matrix to 
get the indicator matrix
+val drmIndicatorsAtA = computeIndicators(drmAtA, numUsers, 
maxInterestingItemsPerThing, bcastInteractionsPerItemA,
+  bcastInteractionsPerItemA, crossCooccurrence = false)
+
+var indicatorMatrices = List(drmIndicatorsAtA)
+
+// Now look at cross-co-occurrences
+for (drmBRaw <- drmBs) {
+  // Down-sample and pin other interaction matrix
+  val drmB = sampleDownAndBinarize(drmBRaw, randomSeed, 
maxNumInteractions).checkpoint()
+
+  // Compute & broadcast the number of interactions per thing in B
+  val bcastInteractionsPerThingB = drmBroadcast(drmB.colCounts)
+
+  // Compute cross-co-occurrence matrix B'A
+  val drmBtA = drmB.t %*% drmA
+
+  val drmIndicatorsBtA = computeIndicators(drmBtA, numUsers, 
maxInterestingItemsPerThing,
+bcastInteractionsPerThingB, bcastInteractionsPerItemA)
+
+  indicatorMatrices = indicatorMatrices :+ drmIndicatorsBtA
+
+  drmB.uncache()
+}
+
+// Unpin downsampled interaction matrix
+drmA.uncache()
+
+// Return list of indicator matrices
+indicatorMatrices
+  }
+
+  /**
+   * Compute loglikelihood ratio
+   * see http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html 
for details
+   **/
+  def loglikelihoodRatio(numInteractionsWithA: Long, numInteractionsWithB: 
Long,
+ numInteractionsWithAandB: Long, numInteractions: 
Long) = {
+
 

[jira] [Commented] (MAHOUT-1578) Optimizations in matrix serialization

2014-06-12 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029470#comment-14029470
 ] 

Hudson commented on MAHOUT-1578:


SUCCESS: Integrated in Mahout-Quality #2651 (See 
[https://builds.apache.org/job/Mahout-Quality/2651/])
MAHOUT-1578 Optimizations in matrix serialization (ssc) closes apache/mahout#16 
(ssc: rev a8e09cd3aa1c9d6fdda2eaf84f86a06a33963658)
* mrlegacy/src/main/java/org/apache/mahout/math/MatrixWritable.java
* mrlegacy/src/main/java/org/apache/mahout/math/VectorWritable.java
* CHANGELOG


> Optimizations in matrix serialization
> -
>
> Key: MAHOUT-1578
> URL: https://issues.apache.org/jira/browse/MAHOUT-1578
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
>
> MatrixWritable contains inefficient code in a few places:
>  
>  * type and size are stored with every vector, although they are the same for 
> every vector
>  * in some places vectors are added to the matrix via assign() in places 
> where we could directly use the instance
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029464#comment-14029464
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13714291
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.cf
+
+import org.apache.mahout.math._
+import scalabindings._
+import RLikeOps._
+import drm._
+import RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+import scala.collection.JavaConversions._
+import org.apache.mahout.math.stats.LogLikelihood
+import collection._
+import org.apache.mahout.common.RandomUtils
+import org.apache.mahout.math.function.{VectorFunction, Functions}
+
+
+/**
+ * based on "Ted Dunnning & Ellen Friedman: Practical Machine Learning, 
Innovations in Recommendation",
+ * available at http://www.mapr.com/practical-machine-learning
+ *
+ * see also "Sebastian Schelter, Christoph Boden, Volker Markl:
+ * Scalable Similarity-Based Neighborhood Methods with MapReduce
+ * ACM Conference on Recommender Systems 2012"
+ */
+object CooccurrenceAnalysis extends Serializable {
+
+  /** Compares (Int,Double) pairs by the second value */
+  private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case 
((_, score1), (_, score2)) => score1 > score2}
+
+  def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingItemsPerThing: Int = 50,
+maxNumInteractions: Int = 500, drmBs: 
Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = {
+
+implicit val distributedContext = drmARaw.context
+
+// Apply selective downsampling, pin resulting matrix
+val drmA = sampleDownAndBinarize(drmARaw, randomSeed, 
maxNumInteractions)
+
+// num users, which equals the maximum number of interactions per item
+val numUsers = drmA.nrow.toInt
+
+// Compute & broadcast the number of interactions per thing in A
+val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)
--- End diff --

colCounts or whatever we call it is just as efficient, is distributed and 
tells the reader what is the important value. 


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029446#comment-14029446
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13713968
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.cf
+
+import org.apache.mahout.math._
+import scalabindings._
+import RLikeOps._
+import drm._
+import RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+import scala.collection.JavaConversions._
+import org.apache.mahout.math.stats.LogLikelihood
+import collection._
+import org.apache.mahout.common.RandomUtils
+import org.apache.mahout.math.function.{VectorFunction, Functions}
+
+
+/**
+ * based on "Ted Dunnning & Ellen Friedman: Practical Machine Learning, 
Innovations in Recommendation",
+ * available at http://www.mapr.com/practical-machine-learning
+ *
+ * see also "Sebastian Schelter, Christoph Boden, Volker Markl:
+ * Scalable Similarity-Based Neighborhood Methods with MapReduce
+ * ACM Conference on Recommender Systems 2012"
+ */
+object CooccurrenceAnalysis extends Serializable {
+
+  /** Compares (Int,Double) pairs by the second value */
+  private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case 
((_, score1), (_, score2)) => score1 > score2}
+
+  def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingItemsPerThing: Int = 50,
+maxNumInteractions: Int = 500, drmBs: 
Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = {
+
+implicit val distributedContext = drmARaw.context
+
+// Apply selective downsampling, pin resulting matrix
+val drmA = sampleDownAndBinarize(drmARaw, randomSeed, 
maxNumInteractions)
+
+// num users, which equals the maximum number of interactions per item
+val numUsers = drmA.nrow.toInt
+
+// Compute & broadcast the number of interactions per thing in A
+val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)
+
+// Compute co-occurrence matrix A'A
+val drmAtA = drmA.t %*% drmA
+
+// Compute loglikelihood scores and sparsify the resulting matrix to 
get the indicator matrix
+val drmIndicatorsAtA = computeIndicators(drmAtA, numUsers, 
maxInterestingItemsPerThing, bcastInteractionsPerItemA,
+  bcastInteractionsPerItemA, crossCooccurrence = false)
+
+var indicatorMatrices = List(drmIndicatorsAtA)
+
+// Now look at cross-co-occurrences
+for (drmBRaw <- drmBs) {
+  // Down-sample and pin other interaction matrix
+  val drmB = sampleDownAndBinarize(drmBRaw, randomSeed, 
maxNumInteractions).checkpoint()
+
+  // Compute & broadcast the number of interactions per thing in B
+  val bcastInteractionsPerThingB = drmBroadcast(drmB.colCounts)
--- End diff --

drmB is already binary here, so we could use colSums


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422.

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029445#comment-14029445
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13713951
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.cf
+
+import org.apache.mahout.math._
+import scalabindings._
+import RLikeOps._
+import drm._
+import RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+import scala.collection.JavaConversions._
+import org.apache.mahout.math.stats.LogLikelihood
+import collection._
+import org.apache.mahout.common.RandomUtils
+import org.apache.mahout.math.function.{VectorFunction, Functions}
+
+
+/**
+ * based on "Ted Dunnning & Ellen Friedman: Practical Machine Learning, 
Innovations in Recommendation",
+ * available at http://www.mapr.com/practical-machine-learning
+ *
+ * see also "Sebastian Schelter, Christoph Boden, Volker Markl:
+ * Scalable Similarity-Based Neighborhood Methods with MapReduce
+ * ACM Conference on Recommender Systems 2012"
+ */
+object CooccurrenceAnalysis extends Serializable {
+
+  /** Compares (Int,Double) pairs by the second value */
+  private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case 
((_, score1), (_, score2)) => score1 > score2}
+
+  def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingItemsPerThing: Int = 50,
+maxNumInteractions: Int = 500, drmBs: 
Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = {
+
+implicit val distributedContext = drmARaw.context
+
+// Apply selective downsampling, pin resulting matrix
+val drmA = sampleDownAndBinarize(drmARaw, randomSeed, 
maxNumInteractions)
+
+// num users, which equals the maximum number of interactions per item
+val numUsers = drmA.nrow.toInt
+
+// Compute & broadcast the number of interactions per thing in A
+val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)
--- End diff --

drmA is already binary here, so we could use colSums


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029449#comment-14029449
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13714024
  
--- Diff: 
spark/src/main/scala/org/apache/mahout/cf/CooccurrenceAnalysis.scala ---
@@ -0,0 +1,214 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.mahout.cf
+
+import org.apache.mahout.math._
+import scalabindings._
+import RLikeOps._
+import drm._
+import RLikeDrmOps._
+import org.apache.mahout.sparkbindings._
+import scala.collection.JavaConversions._
+import org.apache.mahout.math.stats.LogLikelihood
+import collection._
+import org.apache.mahout.common.RandomUtils
+import org.apache.mahout.math.function.{VectorFunction, Functions}
+
+
+/**
+ * based on "Ted Dunnning & Ellen Friedman: Practical Machine Learning, 
Innovations in Recommendation",
+ * available at http://www.mapr.com/practical-machine-learning
+ *
+ * see also "Sebastian Schelter, Christoph Boden, Volker Markl:
+ * Scalable Similarity-Based Neighborhood Methods with MapReduce
+ * ACM Conference on Recommender Systems 2012"
+ */
+object CooccurrenceAnalysis extends Serializable {
+
+  /** Compares (Int,Double) pairs by the second value */
+  private val orderByScore = Ordering.fromLessThan[(Int, Double)] { case 
((_, score1), (_, score2)) => score1 > score2}
+
+  def cooccurrences(drmARaw: DrmLike[Int], randomSeed: Int = 0xdeadbeef, 
maxInterestingItemsPerThing: Int = 50,
+maxNumInteractions: Int = 500, drmBs: 
Array[DrmLike[Int]] = Array()): List[DrmLike[Int]] = {
+
+implicit val distributedContext = drmARaw.context
+
+// Apply selective downsampling, pin resulting matrix
+val drmA = sampleDownAndBinarize(drmARaw, randomSeed, 
maxNumInteractions)
+
+// num users, which equals the maximum number of interactions per item
+val numUsers = drmA.nrow.toInt
+
+// Compute & broadcast the number of interactions per thing in A
+val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)
+
+// Compute co-occurrence matrix A'A
+val drmAtA = drmA.t %*% drmA
+
+// Compute loglikelihood scores and sparsify the resulting matrix to 
get the indicator matrix
+val drmIndicatorsAtA = computeIndicators(drmAtA, numUsers, 
maxInterestingItemsPerThing, bcastInteractionsPerItemA,
+  bcastInteractionsPerItemA, crossCooccurrence = false)
+
+var indicatorMatrices = List(drmIndicatorsAtA)
+
+// Now look at cross-co-occurrences
+for (drmBRaw <- drmBs) {
+  // Down-sample and pin other interaction matrix
+  val drmB = sampleDownAndBinarize(drmBRaw, randomSeed, 
maxNumInteractions).checkpoint()
+
+  // Compute & broadcast the number of interactions per thing in B
+  val bcastInteractionsPerThingB = drmBroadcast(drmB.colCounts)
+
+  // Compute cross-co-occurrence matrix B'A
+  val drmBtA = drmB.t %*% drmA
+
+  val drmIndicatorsBtA = computeIndicators(drmBtA, numUsers, 
maxInterestingItemsPerThing,
+bcastInteractionsPerThingB, bcastInteractionsPerItemA)
+
+  indicatorMatrices = indicatorMatrices :+ drmIndicatorsBtA
+
+  drmB.uncache()
+}
+
+// Unpin downsampled interaction matrix
+drmA.uncache()
+
+// Return list of indicator matrices
+indicatorMatrices
+  }
+
+  /**
+   * Compute loglikelihood ratio
+   * see http://tdunning.blogspot.de/2008/03/surprise-and-coincidence.html 
for details
+   **/
+  def loglikelihoodRatio(numInteractionsWithA: Long, numInteractionsWithB: 
Long,
+ numInteractionsWithAandB: Long, numInteractions: 
Long) = {

[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029441#comment-14029441
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user sscdotopen commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45921381
  
I think the name _colCounts_ is misleading, we should stick to something 
like numNonZeroElementsPerColumn or so, not sure here.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Pat Ferrel
Not sure how this relates to the PR.

If you look here you can see all the PR files and diffs from master. Comments 
can be attached to the files in question.
https://github.com/apache/mahout/pull/12/files

iterateNonZero is not in question afaik, and is used in a couple places. If 
someone wants to write an alternative I’ll be happy to change things.

On Jun 12, 2014, at 10:06 AM, Sebastian Schelter  wrote:

Ok, but the current implementation still gives the correct number, as it checks 
for accidental zeros.

I think we should add some custom implementations here to not have to go 
through the non-zeroes iterator.

--sebastian

On 06/12/2014 07:00 PM, Ted Dunning wrote:
> The reason is that sparse implementations may have recorded a non-zero that
> later got assigned a zero, but they didn't bother to remove the memory cell.
> 
> 
> 
> 
> On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter  wrote:
> 
>> I'm a bit lost in this discussion. Why do we assume that
>> getNumNonZeroElements() on a Vector only returns an upper bound? The code
>> in AbstractVector clearly returns the non-zeros only:
>> 
>> int count = 0;
>> Iterator it = iterateNonZero();
>> while (it.hasNext()) {
>>   if (it.next().get() != 0.0) {
>> count++;
>>   }
>> }
>> return count;
>> 
>> On the other hand, the internal code seems broken here, why does
>> iterateNonZero potentially return 0's?
>> 
>> --sebastian
>> 
>> 
>> 
>> 
>> 
>> 
>> On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
>> 
>>> 
>>>  [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
>>> com.atlassian.jira.plugin.system.issuetabpanels:comment-
>>> tabpanel&focusedCommentId=14029345#comment-14029345 ]
>>> 
>>> ASF GitHub Bot commented on MAHOUT-1464:
>>> 
>>> 
>>> Github user dlyubimov commented on the pull request:
>>> 
>>>  https://github.com/apache/mahout/pull/12#issuecomment-45915940
>>> 
>>>  fix header to say MAHOUT-1464, then hit close and reopen, it will
>>> restart the echo.
>>> 
>>> 
>>>  Cooccurrence Analysis on Spark
 --
 
  Key: MAHOUT-1464
  URL: https://issues.apache.org/jira/browse/MAHOUT-1464
  Project: Mahout
   Issue Type: Improvement
   Components: Collaborative Filtering
  Environment: hadoop, spark
 Reporter: Pat Ferrel
 Assignee: Pat Ferrel
  Fix For: 1.0
 
  Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
 MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
 run-spark-xrsj.sh
 
 
 Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
 that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
 a DRM can be used as input.
 Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
 has several applications including cross-action recommendations.
 
>>> 
>>> 
>>> 
>>> --
>>> This message was sent by Atlassian JIRA
>>> (v6.2#6252)
>>> 
>>> 
>> 
> 




Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Ted Dunning
The reason is that sparse implementations may have recorded a non-zero that
later got assigned a zero, but they didn't bother to remove the memory cell.




On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter  wrote:

> I'm a bit lost in this discussion. Why do we assume that
> getNumNonZeroElements() on a Vector only returns an upper bound? The code
> in AbstractVector clearly returns the non-zeros only:
>
> int count = 0;
> Iterator it = iterateNonZero();
> while (it.hasNext()) {
>   if (it.next().get() != 0.0) {
> count++;
>   }
> }
> return count;
>
> On the other hand, the internal code seems broken here, why does
> iterateNonZero potentially return 0's?
>
> --sebastian
>
>
>
>
>
>
> On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
>
>>
>>  [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
>> com.atlassian.jira.plugin.system.issuetabpanels:comment-
>> tabpanel&focusedCommentId=14029345#comment-14029345 ]
>>
>> ASF GitHub Bot commented on MAHOUT-1464:
>> 
>>
>> Github user dlyubimov commented on the pull request:
>>
>>  https://github.com/apache/mahout/pull/12#issuecomment-45915940
>>
>>  fix header to say MAHOUT-1464, then hit close and reopen, it will
>> restart the echo.
>>
>>
>>  Cooccurrence Analysis on Spark
>>> --
>>>
>>>  Key: MAHOUT-1464
>>>  URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>>>  Project: Mahout
>>>   Issue Type: Improvement
>>>   Components: Collaborative Filtering
>>>  Environment: hadoop, spark
>>> Reporter: Pat Ferrel
>>> Assignee: Pat Ferrel
>>>  Fix For: 1.0
>>>
>>>  Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
>>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
>>> run-spark-xrsj.sh
>>>
>>>
>>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
>>> that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
>>> a DRM can be used as input.
>>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
>>> has several applications including cross-action recommendations.
>>>
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.2#6252)
>>
>>
>


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Sebastian Schelter
Ok, but the current implementation still gives the correct number, as it 
checks for accidental zeros.


I think we should add some custom implementations here to not have to go 
through the non-zeroes iterator.


--sebastian

On 06/12/2014 07:00 PM, Ted Dunning wrote:

The reason is that sparse implementations may have recorded a non-zero that
later got assigned a zero, but they didn't bother to remove the memory cell.




On Thu, Jun 12, 2014 at 9:50 AM, Sebastian Schelter  wrote:


I'm a bit lost in this discussion. Why do we assume that
getNumNonZeroElements() on a Vector only returns an upper bound? The code
in AbstractVector clearly returns the non-zeros only:

 int count = 0;
 Iterator it = iterateNonZero();
 while (it.hasNext()) {
   if (it.next().get() != 0.0) {
 count++;
   }
 }
 return count;

On the other hand, the internal code seems broken here, why does
iterateNonZero potentially return 0's?

--sebastian






On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:



  [ https://issues.apache.org/jira/browse/MAHOUT-1464?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel&focusedCommentId=14029345#comment-14029345 ]

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

  https://github.com/apache/mahout/pull/12#issuecomment-45915940

  fix header to say MAHOUT-1464, then hit close and reopen, it will
restart the echo.


  Cooccurrence Analysis on Spark

--

  Key: MAHOUT-1464
  URL: https://issues.apache.org/jira/browse/MAHOUT-1464
  Project: Mahout
   Issue Type: Improvement
   Components: Collaborative Filtering
  Environment: hadoop, spark
 Reporter: Pat Ferrel
 Assignee: Pat Ferrel
  Fix For: 1.0

  Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch,
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch,
run-spark-xrsj.sh


Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR)
that runs on Spark. This should be compatible with Mahout Spark DRM DSL so
a DRM can be used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence
has several applications including cross-action recommendations.





--
This message was sent by Atlassian JIRA
(v6.2#6252)










[jira] [Commented] (MAHOUT-1578) Optimizations in matrix serialization

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029414#comment-14029414
 ] 

ASF GitHub Bot commented on MAHOUT-1578:


Github user asfgit closed the pull request at:

https://github.com/apache/mahout/pull/16


> Optimizations in matrix serialization
> -
>
> Key: MAHOUT-1578
> URL: https://issues.apache.org/jira/browse/MAHOUT-1578
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Reporter: Sebastian Schelter
> Fix For: 1.0
>
>
> MatrixWritable contains inefficient code in a few places:
>  
>  * type and size are stored with every vector, although they are the same for 
> every vector
>  * in some places vectors are added to the matrix via assign() in places 
> where we could directly use the instance
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Pat Ferrel
The SparkEngine colCounts functions was checking for >= 0. But because it was 
iterating nonZero it never got an == 0, so a bug that didn’t surface. It’s 
already been fixed.

The primary question at present is: what should we call colCounts? Currently it 
is used in cooccurrence:

val bcastInteractionsPerItemA = drmBroadcast(drmA.colCounts)

Dmitriy wanted you to see if this fits R-Like semantics and suggest an 
alternative, if possible. I was commenting on the possible Java related naming 
so ignore any misstatements.

On Jun 12, 2014, at 9:50 AM, Sebastian Schelter  wrote:

I'm a bit lost in this discussion. Why do we assume that 
getNumNonZeroElements() on a Vector only returns an upper bound? The code in 
AbstractVector clearly returns the non-zeros only:

   int count = 0;
   Iterator it = iterateNonZero();
   while (it.hasNext()) {
 if (it.next().get() != 0.0) {
   count++;
 }
   }
   return count;

On the other hand, the internal code seems broken here, why does iterateNonZero 
potentially return 0's?

--sebastian





On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:
> 
> [ 
> https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029345#comment-14029345
>  ]
> 
> ASF GitHub Bot commented on MAHOUT-1464:
> 
> 
> Github user dlyubimov commented on the pull request:
> 
> https://github.com/apache/mahout/pull/12#issuecomment-45915940
> 
> fix header to say MAHOUT-1464, then hit close and reopen, it will restart 
> the echo.
> 
> 
>> Cooccurrence Analysis on Spark
>> --
>> 
>> Key: MAHOUT-1464
>> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>> Project: Mahout
>>  Issue Type: Improvement
>>  Components: Collaborative Filtering
>> Environment: hadoop, spark
>>Reporter: Pat Ferrel
>>Assignee: Pat Ferrel
>> Fix For: 1.0
>> 
>> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, 
>> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
>> run-spark-xrsj.sh
>> 
>> 
>> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
>> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
>> can be used as input.
>> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
>> several applications including cross-action recommendations.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.2#6252)
> 




Re: [jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread Sebastian Schelter
I'm a bit lost in this discussion. Why do we assume that 
getNumNonZeroElements() on a Vector only returns an upper bound? The 
code in AbstractVector clearly returns the non-zeros only:


int count = 0;
Iterator it = iterateNonZero();
while (it.hasNext()) {
  if (it.next().get() != 0.0) {
count++;
  }
}
return count;

On the other hand, the internal code seems broken here, why does 
iterateNonZero potentially return 0's?


--sebastian





On 06/12/2014 06:38 PM, ASF GitHub Bot (JIRA) wrote:


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029345#comment-14029345
 ]

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

 https://github.com/apache/mahout/pull/12#issuecomment-45915940

 fix header to say MAHOUT-1464, then hit close and reopen, it will restart 
the echo.



Cooccurrence Analysis on Spark
--

 Key: MAHOUT-1464
 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
 Environment: hadoop, spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0

 Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh


Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that runs 
on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM can be 
used as input.
Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
several applications including cross-action recommendations.




--
This message was sent by Atlassian JIRA
(v6.2#6252)





[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029374#comment-14029374
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45917234
  
I already fixed the header.

I agree with Ted, kinda what functional programming is for. The reason I 
didn't use the Java aggregate is because it wasn't distributed. Still probably 
beyond this ticket. I'll refactor if a Scala journeyman wants to provide a 
general mechanism. I'm still on training wheels.

This still needs to be tested in a distributed Spark+HDFS environment and 
MAHOUT-1561 will make testing easy. I'd be happy to merge this and move on, 
which will have the side effect of testing larger datasets and clusters.

If Someone wants to test this now on a Spark+HDFS cluster, please do!


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029345#comment-14029345
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45915940
  
fix header to say MAHOUT-1464, then hit close and reopen, it will restart 
the echo.


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029339#comment-14029339
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13711381
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala 
---
@@ -188,4 +188,8 @@ object MatrixOps {
 def apply(f: Vector): Double = f.sum
   }
 
+  private def vectorCountFunc = new VectorFunction {
+def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.greater(0))
+  }
+
 }
--- End diff --

it looks like, to me. don't have time to look in depth. but distributed 
code definitely counts non-negatives with explicit inline conditional >0


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029344#comment-14029344
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user dlyubimov commented on a diff in the pull request:

https://github.com/apache/mahout/pull/12#discussion_r13711414
  
--- Diff: 
math-scala/src/main/scala/org/apache/mahout/math/scalabindings/MatrixOps.scala 
---
@@ -188,4 +188,8 @@ object MatrixOps {
 def apply(f: Vector): Double = f.sum
   }
 
+  private def vectorCountFunc = new VectorFunction {
+def apply(f: Vector): Double = f.aggregate(Functions.PLUS, 
Functions.greater(0))
+  }
+
 }
--- End diff --

it is very easy to tweak tests though to check if in doubt


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029315#comment-14029315
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user tdunning commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45913357
  
This discussion isn't getting echoed to the mailing list.  I didn't even 
know it was happening.

I think that a non-zero counter is nice, but it would be better to have a 
more general general aggregator of somethings.  We have two instances already 
of this pattern and there will be more (sum of the abs values is common).

Why not implement a general aggregator?  THis is different from our current 
aggregateColumns because that function is not parallelizable.

Something like def columnAggregator(combiner, mapper) is what I am aiming 
for.  Positive counter would be m.columnAggregator(_ + _, _ > 0)

 


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1464) Cooccurrence Analysis on Spark

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029310#comment-14029310
 ] 

ASF GitHub Bot commented on MAHOUT-1464:


Github user pferrel commented on the pull request:

https://github.com/apache/mahout/pull/12#issuecomment-45912084
  
Awaiting Sebastian's take on the naming of 'colCounts' to better fit R-Like 
Semantics


> Cooccurrence Analysis on Spark
> --
>
> Key: MAHOUT-1464
> URL: https://issues.apache.org/jira/browse/MAHOUT-1464
> Project: Mahout
>  Issue Type: Improvement
>  Components: Collaborative Filtering
> Environment: hadoop, spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0
>
> Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAHOUT-1573) More explicit parallelism adjustments in math-scala DRM apis; elements of automatic re-adjustments

2014-06-12 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029299#comment-14029299
 ] 

ASF GitHub Bot commented on MAHOUT-1573:


Github user tdunning commented on the pull request:

https://github.com/apache/mahout/pull/13#issuecomment-45910978
  
Could we have some examples?


> More explicit parallelism adjustments in math-scala DRM apis; elements of 
> automatic re-adjustments
> --
>
> Key: MAHOUT-1573
> URL: https://issues.apache.org/jira/browse/MAHOUT-1573
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.9
>Reporter: Dmitriy Lyubimov
>Assignee: Dmitriy Lyubimov
> Fix For: 1.0
>
>
> (1) add minSplit parameter pass-thru to drmFromHDFS to be able to explicitly 
> increase parallelism. 
> (2) add parrallelism readjustment parameter to a checkpoint() call. This 
> implies shuffle-less coalesce() translation to the data set before it is 
> requested to be cached (if specified).
> Going forward, we probably should try and figure how we can automate it,  at 
> least a little bit. For example, the simplest automatic adjustment might 
> include re-adjust parallelims on load to simply fit cluster size (95% or 180% 
> of cluster size, for example), with some rule-of-thumb safeguards here, e.g. 
> we cannot exceed a factor of say 8 (or whatever we configure) in splitting 
> each original hdfs split. We should be able to get a reasonable parallelism 
> performance out of the box on simple heuristics like that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)