[jira] [Created] (MAHOUT-2048) There are duplicate content pages which need redirects instead

2018-06-27 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-2048:
--

 Summary: There are duplicate content pages which need redirects 
instead
 Key: MAHOUT-2048
 URL: https://issues.apache.org/jira/browse/MAHOUT-2048
 Project: Mahout
  Issue Type: Planned Work
  Components: website
Affects Versions: 0.13.0
Reporter: Pat Ferrel
Assignee: Andrew Musselman
 Fix For: 0.13.0


I have duplicated content in 3 places in the `website/` directory. We need to 
have one place for the real content and replace the dups with redirect to the 
actual content. This looks like is may be true for several other pages and 
honestly I'm not sure if they are all needed but there are many links out in 
the wild that point to the old path for the CCO recommender pages so we should 
do this for the ones below at least. Better yet we may want to clean out any 
other dups unless someone knows why not.



TLDR;

Actual content:

mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/docs/latest/algorithms/recommenders/cco.md

 

Dups to be replaced with redirects to the above content. I vaguely remember all 
these different site structures so there may be links to them in the wild.


mahout/website/recommender-overview.md => 
mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/users/algorithms/intro-cooccurrence-spark.md => 
mahout/website/docs/latest/algorithms/recommenders/cco.md

mahout/website/users/recommender/quickstart.md => 
mahout/website/docs/latest/algorithms/recommenders/index.md

mahout/website/users/recommender/intro-cooccurrence-spark.md => 
mahout/website/docs/latest/algorithms/recommenders/cco.md



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MAHOUT-2023) Drivers broken, scopt classes not found

2017-10-05 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-2023:
--

 Summary: Drivers broken, scopt classes not found
 Key: MAHOUT-2023
 URL: https://issues.apache.org/jira/browse/MAHOUT-2023
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.13.1
 Environment: any
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Priority: Blocker
 Fix For: 0.13.1


Type `mahout spark-itemsimilarity` after Mahout is installed properly and you 
get a fatal exception due to missing scopt classes.

Probably a build issue related to incorrect versions of scopt being looked for.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MAHOUT-2020) Maven repo structure compatibility with SBT

2017-10-03 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-2020:
--

 Summary: Maven repo structure compatibility with SBT
 Key: MAHOUT-2020
 URL: https://issues.apache.org/jira/browse/MAHOUT-2020
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.13.1
 Environment: Creating a project from maven built Mahout using sbt. 
Made critical since it seems to block using Mahout with sbt. At least I have 
found no way to do it.
Reporter: Pat Ferrel
Assignee: Trevor Grant
Priority: Critical
 Fix For: 0.13.1


The maven repo should build:
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

substitute Spark version for -2.1, so -1.6 etc.

The build.sbt  `libraryDependencies` line then will be:
`"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT`

This is parsed by sbt to yield the path of :
org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar

The outcome of `mvn clean install` currently is something like:
org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar

This has no effect on the package structure, only artifact naming and maven 
repo structure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MAHOUT-2019) SparseRowMatrix assign ops user for loops instead of iterateNonZero and so can be optimized

2017-10-02 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-2019:
--

 Summary: SparseRowMatrix assign ops user for loops instead of 
iterateNonZero and so can be optimized
 Key: MAHOUT-2019
 URL: https://issues.apache.org/jira/browse/MAHOUT-2019
 Project: Mahout
  Issue Type: Bug
  Components: Math
Affects Versions: 0.13.0
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 0.13.1


DRMs get blockified into SparseRowMatrix instances if the density is low. But 
SRM inherits the implementation of method like "assign" from AbstractMatrix, 
which uses nest for loops to traverse rows. For multiplying 2 matrices that are 
extremely sparse, the kind if data you see in collaborative filtering, this is 
extremely wasteful of execution time. Better to use a sparse vector's 
iterateNonZero Iterator for some function types.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MAHOUT-1951) Drivers don't run with remote Spark

2017-06-19 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1951:
---

The jar isn't supposed to have all deps, only the ones not provided by the 
environment. In fact it is supposed to have the minimum. 

So it appears some of the provided classes for previous platforms (Spark etc) 
have change in new versions? We then need to add to the dependency reduced jar 
but first check to see if a newer version of some provided dep will fill the 
bill or dependency-reduced will bloat needlessly.

What specifically is the error, what is missing.

> Drivers don't run with remote Spark
> ---
>
> Key: MAHOUT-1951
> URL: https://issues.apache.org/jira/browse/MAHOUT-1951
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, CLI, Collaborative Filtering
>Affects Versions: 0.13.0
> Environment: The command line drivers spark-itemsimilarity and 
> spark-naivebayes using a remote or pseudo-clustered Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Missing classes when running these jobs because the dependencies-reduced jar, 
> passed to Spark for serialization purposes, does not contain all needed 
> classes.
> Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MAHOUT-1988) scala 2.10 is hardcoded somewhere

2017-06-05 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037567#comment-16037567
 ] 

Pat Ferrel commented on MAHOUT-1988:


Don't have time to look now but believe Scopt may hardcode 2.10. I know and use 
the 2.11 version and it is very little changed so putting one in the scala 2.10 
profile and another in the 2.11 should be simple, no?

>  scala 2.10 is hardcoded somewhere
> --
>
> Key: MAHOUT-1988
> URL: https://issues.apache.org/jira/browse/MAHOUT-1988
> Project: Mahout
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Andrew Palumbo
>Priority: Blocker
> Fix For: 0.13.1
>
>
> After building mahout against scala 2.11: 
> {code}
> mvn clean install -Dscala.version=2.11.4 -Dscala.compat.version=2.11 
> -Phadoop2  -DskipTests
> {code}
> ViennaCL jars are built hard-coded to scala 2.10.  This is currently blocking 
> the 0.13.1 release. 
> {code}
> mahout-h2o_2.11-0.13.1-SNAPSHOT.jar
> mahout-hdfs-0.13.1-SNAPSHOT.jar
> mahout-math-0.13.1-SNAPSHOT.jar
> mahout-math-scala_2.11-0.13.1-SNAPSHOT.jar
> mahout-mr-0.13.1-SNAPSHOT.jar
> mahout-native-cuda_2.10-0.13.0-SNAPSHOT.jar
> mahout-native-cuda_2.10-0.13.1-SNAPSHOT.jar
> mahout-native-viennacl_2.10-0.13.1-SNAPSHOT.jar
> mahout-native-viennacl-omp_2.10-0.13.1-SNAPSHOT.jar
> mahout-spark_2.11-0.13.1-SNAPSHOT-dependency-reduced.jar
> mahout-spark_2.11-0.13.1-SNAPSHOT.jar
> {code} 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark

2017-03-09 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903739#comment-15903739
 ] 

Pat Ferrel commented on MAHOUT-1951:


Oops misnamed the commit message for MAHOUT-1950. The fix is in master, unit 
tested, and driver integration tested on remote spark and HDFS.

Just removed line 83 in 
mahout/spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala

Needs to be tested thoroughly since I have no idea of the ramifications of 
removing the line. See [~Andrew_Palumbo] who added it but can't recall the 
reason either.

Cross your fingers.

> Drivers don't run with remote Spark
> ---
>
> Key: MAHOUT-1951
> URL: https://issues.apache.org/jira/browse/MAHOUT-1951
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, CLI, Collaborative Filtering
>Affects Versions: 0.13.0
> Environment: The command line drivers spark-itemsimilarity and 
> spark-naivebayes using a remote or pseudo-clustered Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Missing classes when running these jobs because the dependencies-reduced jar, 
> passed to Spark for serialization purposes, does not contain all needed 
> classes.
> Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (MAHOUT-1951) Drivers don't run with remote Spark

2017-03-09 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1951.

Resolution: Fixed

Test thoroughly, not sure of side effects of the fix

> Drivers don't run with remote Spark
> ---
>
> Key: MAHOUT-1951
> URL: https://issues.apache.org/jira/browse/MAHOUT-1951
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, CLI, Collaborative Filtering
>Affects Versions: 0.13.0
> Environment: The command line drivers spark-itemsimilarity and 
> spark-naivebayes using a remote or pseudo-clustered Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Missing classes when running these jobs because the dependencies-reduced jar, 
> passed to Spark for serialization purposes, does not contain all needed 
> classes.
> Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark

2017-03-09 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903327#comment-15903327
 ] 

Pat Ferrel commented on MAHOUT-1951:


[~Andrew_Palumbo] [~smarthi] There seems to be some question about who made the 
commit but I'm sure it wasn't me. I have no idea what is causing this, as I 
said, and the only thing suspicious in it (the one before works BTW) is the 
mahout jars line change in the Spark module. the rest of the changes are in 
Flink afaict.

> Drivers don't run with remote Spark
> ---
>
> Key: MAHOUT-1951
> URL: https://issues.apache.org/jira/browse/MAHOUT-1951
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, CLI, Collaborative Filtering
>Affects Versions: 0.13.0
> Environment: The command line drivers spark-itemsimilarity and 
> spark-naivebayes using a remote or pseudo-clustered Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Missing classes when running these jobs because the dependencies-reduced jar, 
> passed to Spark for serialization purposes, does not contain all needed 
> classes.
> Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark

2017-03-09 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903320#comment-15903320
 ] 

Pat Ferrel commented on MAHOUT-1951:


A quick way to test this is:

1) get Spark and HDFS running locally in pseudo-cluster mode.
2) build the version of Mahout under test I use simply "mvn clean install 
-DskipTests"
3) "hdfs dfs -rm -r test-results" to removes any old results
4) run the script below and look for exceptions in the output, they will look 
like the above errors

#!/usr/bin/env bash
#begin script
mahout spark-itemsimilarity \
--input test.csv \
--output test-result \
--master spark://Maclaurin.local:7077 \
--filter1 purchase \
--filter2 view \
--itemIDColumn 2 \
--rowIDColumn 0 \
--filterColumn 1
#end-script

test.csv file for the script

u1,purchase,iphone
u1,purchase,ipad
u2,purchase,nexus
u2,purchase,galaxy
u3,purchase,surface
u4,purchase,iphone
u4,purchase,galaxy
u1,view,iphone
u1,view,ipad
u1,view,nexus
u1,view,galaxy
u2,view,iphone
u2,view,ipad
u2,view,nexus
u2,view,galaxy
u3,view,surface
u3,view,nexus
u4,view,iphone
u4,view,ipad


> Drivers don't run with remote Spark
> ---
>
> Key: MAHOUT-1951
> URL: https://issues.apache.org/jira/browse/MAHOUT-1951
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, CLI, Collaborative Filtering
>Affects Versions: 0.13.0
> Environment: The command line drivers spark-itemsimilarity and 
> spark-naivebayes using a remote or pseudo-clustered Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Missing classes when running these jobs because the dependencies-reduced jar, 
> passed to Spark for serialization purposes, does not contain all needed 
> classes.
> Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark

2017-03-09 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903315#comment-15903315
 ] 

Pat Ferrel commented on MAHOUT-1951:


scratch that PR. We do not have a fix for this but I have narrowed down the 
commit where it first starts to occur.

in Mahout 0.12.2 the drivers work with remote Spark

They work in all commits until 
https://github.com/apache/mahout/commit/8e0e8b5572e0d24c1930ed60fec6d02693b41575
 which would say that something in this commit broke things. This is mainly 
Flink but there is a change to how mahout jars are packaged and the error is 
shown below. The error wording is a bit mysterious, it seem to be missing 
MahoutKryoRegistrator but could also be from a class that cannot be serialized, 
really not sure.

Exception in thread "main" 17/03/06 18:15:04 INFO TaskSchedulerImpl: Removed 
TaskSet 0.0, whose tasks have all completed, from pool 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 
6, 192.168.0.6): java.io.IOException: org.apache.spark.SparkException: Failed 
to register classes with Kryo
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1212)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at 
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:128)
at 
org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273)
at 
org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:258)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:215)
at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1205)
... 11 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$5.apply(KryoSerializer.scala:123)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$5.apply(KryoSerializer.scala:123)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:123)
... 17 more
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.

[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark

2017-03-06 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897636#comment-15897636
 ] 

Pat Ferrel commented on MAHOUT-1951:


fix being tested in https://github.com/apache/mahout/pull/292

> Drivers don't run with remote Spark
> ---
>
> Key: MAHOUT-1951
> URL: https://issues.apache.org/jira/browse/MAHOUT-1951
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, CLI, Collaborative Filtering
>Affects Versions: 0.13.0
> Environment: The command line drivers spark-itemsimilarity and 
> spark-naivebayes using a remote or pseudo-clustered Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Missing classes when running these jobs because the dependencies-reduced jar, 
> passed to Spark for serialization purposes, does not contain all needed 
> classes.
> Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MAHOUT-1952) Allow pass-through of params for driver's CLI to spark-submit

2017-03-06 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1952:
--

 Summary: Allow pass-through of params for driver's CLI to 
spark-submit
 Key: MAHOUT-1952
 URL: https://issues.apache.org/jira/browse/MAHOUT-1952
 Project: Mahout
  Issue Type: New Feature
  Components: Classification, CLI, Collaborative Filtering
Affects Versions: 0.13.0
 Environment: CLI drivers launched from mahout script
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Priority: Minor
 Fix For: 0.13.1


remove driver CLI args that are dups of what spark-submit can do and allow 
passthrough of arbitrary extra CLI to spar-submit using spark-submit parsing.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MAHOUT-1951) Drivers don't run with remote Spark

2017-03-06 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1951:
---
Component/s: Collaborative Filtering
 Classification

> Drivers don't run with remote Spark
> ---
>
> Key: MAHOUT-1951
> URL: https://issues.apache.org/jira/browse/MAHOUT-1951
> Project: Mahout
>  Issue Type: Bug
>  Components: Classification, CLI, Collaborative Filtering
>Affects Versions: 0.13.0
> Environment: The command line drivers spark-itemsimilarity and 
> spark-naivebayes using a remote or pseudo-clustered Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Missing classes when running these jobs because the dependencies-reduced jar, 
> passed to Spark for serialization purposes, does not contain all needed 
> classes.
> Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark

2017-03-06 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897625#comment-15897625
 ] 

Pat Ferrel commented on MAHOUT-1951:


[~rawkintrevo] added the use of spark-submit to the Mahout script for launching 
the drivers, this potentially has some side effects since much of the work of 
spark-submit was done in the drivers and I am not sure if there is a way to 
pass params through to spark-submit. In other words the driver may no permit 
unrecognized params on the command line. Therefor we will leave the drivers as 
they are, doing more work than they should but mark this as deprecated and 
remove in a future release.

> Drivers don't run with remote Spark
> ---
>
> Key: MAHOUT-1951
> URL: https://issues.apache.org/jira/browse/MAHOUT-1951
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 0.13.0
> Environment: The command line drivers spark-itemsimilarity and 
> spark-naivebayes using a remote or pseudo-clustered Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Missing classes when running these jobs because the dependencies-reduced jar, 
> passed to Spark for serialization purposes, does not contain all needed 
> classes.
> Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MAHOUT-1951) Drivers don't run with remote Spark

2017-03-06 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1951:
---

User found the following error running the spark-itemsimilarity driver (affect 
the NB driver too) on a remote Spark master:

17/03/03 10:08:40 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
reco-master): java.io.IOException: org.apache.spark.SparkException: Failed to 
register classes with Kryo
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1212)
at 
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
...
Caused by: java.lang.ClassNotFoundException: 
org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$5.apply(KryoSerializer.scala:123)
at 
org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$5.apply(KryoSerializer.scala:123)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:123)

When I run the exactly same command on the 0.12.2 release distribution against 
the same Spark cluster, the
command completes sucessfully.

My Environment is:
* Ubuntu 14.04
* Oracle-JDK 1.8.0_121
* Spark standalone cluster using this distribution: 
http://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz
* Mahout 0.13.0-RC: 
https://repository.apache.org/content/repositories/orgapachemahout-1034/org/apache/mahout/apache-mahout-distribution/0.13.0/apache-mahout-distribution-0.13.0.tar.gz


> Drivers don't run with remote Spark
> ---
>
> Key: MAHOUT-1951
> URL: https://issues.apache.org/jira/browse/MAHOUT-1951
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 0.13.0
> Environment: The command line drivers spark-itemsimilarity and 
> spark-naivebayes using a remote or pseudo-clustered Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Missing classes when running these jobs because the dependencies-reduced jar, 
> passed to Spark for serialization purposes, does not contain all needed 
> classes.
> Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MAHOUT-1951) Drivers don't run with remote Spark

2017-03-06 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1951:
--

 Summary: Drivers don't run with remote Spark
 Key: MAHOUT-1951
 URL: https://issues.apache.org/jira/browse/MAHOUT-1951
 Project: Mahout
  Issue Type: Bug
  Components: CLI
Affects Versions: 0.13.0
 Environment: The command line drivers spark-itemsimilarity and 
spark-naivebayes using a remote or pseudo-clustered Spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Priority: Blocker
 Fix For: 0.13.0


Missing classes when running these jobs because the dependencies-reduced jar, 
passed to Spark for serialization purposes, does not contain all needed classes.

Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MAHOUT-1951) Drivers don't run with remote Spark

2017-03-06 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1951:
---
Sprint: Jan/Feb-2017

> Drivers don't run with remote Spark
> ---
>
> Key: MAHOUT-1951
> URL: https://issues.apache.org/jira/browse/MAHOUT-1951
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 0.13.0
> Environment: The command line drivers spark-itemsimilarity and 
> spark-naivebayes using a remote or pseudo-clustered Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Blocker
> Fix For: 0.13.0
>
>
> Missing classes when running these jobs because the dependencies-reduced jar, 
> passed to Spark for serialization purposes, does not contain all needed 
> classes.
> Found by a user. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MAHOUT-1940) Provide a Java API to SimilarityAnalysis and any other needed APIs

2017-02-13 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1940:
---
Description: We want to port the functionality from 
org.apache.mahout.math.cf.SimilarityAnalysis.scala to java for easy integration 
with a java project we will be creating that derives a similarity measure from 
the co-occurrence and cross-occurrence matrix.   (was: We want to port the 
functionality from org.apache.mahout.math.cf.SimilarityAnalysis.scala to java 
for easy integration with a java project we will be creating that derives a 
similarity measure from the co-occurence matrix. )

> Provide a Java API to  SimilarityAnalysis and any other needed APIs
> ---
>
> Key: MAHOUT-1940
> URL: https://issues.apache.org/jira/browse/MAHOUT-1940
> Project: Mahout
>  Issue Type: New Feature
>  Components: Algorithms, cooccurrence
>Reporter: James Mackey
>
> We want to port the functionality from 
> org.apache.mahout.math.cf.SimilarityAnalysis.scala to java for easy 
> integration with a java project we will be creating that derives a similarity 
> measure from the co-occurrence and cross-occurrence matrix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MAHOUT-1940) Provide a Java API to SimilarityAnalysis and any other needed APIs

2017-02-13 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1940:
---
Summary: Provide a Java API to  SimilarityAnalysis and any other needed 
APIs  (was: Implementing similarity analysis using co-occurence matrix in java)

> Provide a Java API to  SimilarityAnalysis and any other needed APIs
> ---
>
> Key: MAHOUT-1940
> URL: https://issues.apache.org/jira/browse/MAHOUT-1940
> Project: Mahout
>  Issue Type: New Feature
>  Components: Algorithms, cooccurrence
>Reporter: James Mackey
>
> We want to port the functionality from 
> org.apache.mahout.math.cf.SimilarityAnalysis.scala to java for easy 
> integration with a java project we will be creating that derives a similarity 
> measure from the co-occurence matrix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1940) Implementing similarity analysis using co-occurence matrix in java

2017-02-12 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862862#comment-15862862
 ] 

Pat Ferrel commented on MAHOUT-1940:


This would be Awesome! Let me know if you need help. There are some things that 
are no longer required. I just duplicated some methods to maintain backward 
compatibility, while adding new features.

I also implemented some new helper object `apply` functions, which are 
alternative constructors, outside of Mahout in the PredictionIO Universal 
Recommender Template. When 0.5.1 of the Template is released concurrent with 
PIO 0.11.0 and Mahout 0.13.0. The ones in the Template code are all you will 
need for porting the Template to Java.

To make SimilarityAnalysis complete and accepted into Mahout you'd probably 
need to port all of the SimilarityAnalysis class and IndexedDatasetSpark.

> Implementing similarity analysis using co-occurence matrix in java
> --
>
> Key: MAHOUT-1940
> URL: https://issues.apache.org/jira/browse/MAHOUT-1940
> Project: Mahout
>  Issue Type: New Feature
>  Components: Algorithms, cooccurrence
>Reporter: James Mackey
>
> We want to port the functionality from 
> org.apache.mahout.math.cf.SimilarityAnalysis.scala to java for easy 
> integration with a java project we will be creating that derives a similarity 
> measure from the co-occurence matrix. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MAHOUT-1904) Create a test harness to test mahout across different hardware configurations

2017-01-14 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822907#comment-15822907
 ] 

Pat Ferrel commented on MAHOUT-1904:


Did you have in mind a CLI tool or unit test? I assume the former since this 
should be runnable on various clusters and configs? Is this meant to be a 
benchmark?

Seems like an example maybe rather than CLI in mahout itself.

Not sure I have enough time for 0.13.0 and no don't have a good test harness. 
If we move this out to the next release I'd be interested in doing it since 
recently I've become more interested in performance.

[~Andrew_Palumbo] could you supply some examples? We have at least one medium 
sized dataset of 2 matrices (the dating site, can't recall the name) for the 
larger tests but still they are small compared to real-world data.   

> Create a test harness to test mahout across different hardware configurations
> -
>
> Key: MAHOUT-1904
> URL: https://issues.apache.org/jira/browse/MAHOUT-1904
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.14.0
>Reporter: Andrew Palumbo
>  Labels: test
> Fix For: 0.13.0
>
>
> Creat a set of simple scala programs to be run as a test harness for Linux 
> amd/intel, mac, and avx2(default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1904) Create a test harness to test mahout across different hardware configurations

2017-01-14 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1904:
---
Affects Version/s: (was: 0.12.2)
   0.14.0

> Create a test harness to test mahout across different hardware configurations
> -
>
> Key: MAHOUT-1904
> URL: https://issues.apache.org/jira/browse/MAHOUT-1904
> Project: Mahout
>  Issue Type: Task
>Affects Versions: 0.14.0
>Reporter: Andrew Palumbo
>  Labels: test
> Fix For: 0.13.0
>
>
> Creat a set of simple scala programs to be run as a test harness for Linux 
> amd/intel, mac, and avx2(default).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1786) Make classes implements Serializable for Spark 1.5+

2017-01-14 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1786:
---
Assignee: Andrew Palumbo  (was: Pat Ferrel)

Hmm, removing Kryo altogether is probably a good idea, I have never touched 
this code and do not maintain classes that need this. All my classes either use 
data that is in the above types or base scala types that have serializable. 

I'm sending this back to [~Andrew_Palumbo] for reassignment or further 
discussion. 

If the new serializer if better than Kryo by all means let's move there ASAP.

> Make classes implements Serializable for Spark 1.5+
> ---
>
> Key: MAHOUT-1786
> URL: https://issues.apache.org/jira/browse/MAHOUT-1786
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.11.0
>Reporter: Michel Lemay
>Assignee: Andrew Palumbo
>Priority: Blocker
>  Labels: performance
> Fix For: 0.13.0
>
>
> Spark 1.5 comes with a new very efficient serializer that uses code 
> generation.  It is twice as fast as kryo.  When using mahout, we have to set 
> KryoSerializer because some classes aren't serializable otherwise.  
> I suggest to declare Math classes as "implements Serializable" where needed.  
> For instance, to use coocurence package in spark 1.5, we had to modify 
> AbstractMatrix, AbstractVector, DenseVector and SparseRowMatrix to make it 
> work without Kryo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1882) SequentialAccessSparseVector inerateNonZeros is incorrect.

2017-01-09 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15812561#comment-15812561
 ] 

Pat Ferrel commented on MAHOUT-1882:


Can't see that I use this, at least not obviously unless it is hidden in 
another call. Can you try removing the method and see who complains?

> SequentialAccessSparseVector inerateNonZeros is incorrect.
> --
>
> Key: MAHOUT-1882
> URL: https://issues.apache.org/jira/browse/MAHOUT-1882
> Project: Mahout
>  Issue Type: Bug
>  Components: Math
>Affects Versions: 0.12.2
>Reporter: Andrew Palumbo
>Assignee: Suneel Marthi
>Priority: Critical
> Fix For: 0.13.0
>
>
> In {{SequentialAccessSparseVector}} a bug is noted.  When Cuonting Non-Zero 
> elements {{NonDefaultIterator}} can, under certain circumstances give an 
> incorrect iterator of size different from the actual non-zeroCounts.
> {code}
>  @Override
>   public Iterator iterateNonZero() {
> // TODO: this is a bug, since nonDefaultIterator doesn't hold to non-zero 
> contract.
> return new NonDefaultIterator();
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1786) Make classes implements Serializable for Spark 1.5+

2016-12-19 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761631#comment-15761631
 ] 

Pat Ferrel commented on MAHOUT-1786:


It sounds like we could remove Kryo altogether and improve performance by using 
the new Spark serializer. It also sounds like this uses the more standard 
extending serializable, which is built into many Scala classes IIRC.

Removing Kryo with a performance gains seems a big win. Kryo causes many config 
problems for new users.

> Make classes implements Serializable for Spark 1.5+
> ---
>
> Key: MAHOUT-1786
> URL: https://issues.apache.org/jira/browse/MAHOUT-1786
> Project: Mahout
>  Issue Type: Improvement
>  Components: Math
>Affects Versions: 0.11.0
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: performance
>
> Spark 1.5 comes with a new very efficient serializer that uses code 
> generation.  It is twice as fast as kryo.  When using mahout, we have to set 
> KryoSerializer because some classes aren't serializable otherwise.  
> I suggest to declare Math classes as "implements Serializable" where needed.  
> For instance, to use coocurence package in spark 1.5, we had to modify 
> AbstractMatrix, AbstractVector, DenseVector and SparseRowMatrix to make it 
> work without Kryo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-10-16 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1853.

Resolution: Fixed

> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO

2016-10-16 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1883.

Resolution: Fixed

Hmm, I thought these were aut-resolved with a commit that contains the issue 
name? Maybe I had a senior moment there :-)

> Create a type if IndexedDataset that filters unneeded data for CCO
> --
>
> Key: MAHOUT-1883
> URL: https://issues.apache.org/jira/browse/MAHOUT-1883
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Affects Versions: 0.13.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> The collaborative filtering CCO algo uses drms for each "indicator" type. The 
> input must have the same set of user-id and so the row rank for all input 
> matrices must be the same.
> In the past we have padded the row-id dictionary to include new rows only in 
> secondary matrices. This can lead to very large amounts of data processed in 
> the CCO pipeline that does not affect the results. Put another way if the row 
> doesn't exist in the primary matrix, there will be no cross-occurrence in the 
> other calculated cooccurrences matrix.
> if we are calculating P'P and P'S, S will not need rows that don't exist in P 
> so this Jira is to create an IndexedDataset companion object that takes an 
> RDD[(String, String)] of interactions but that uses the dictionary from P for 
> row-ids and filters out all data that doesn't correspond to P. The companion 
> object will create the row-ids dictionary if it is not passed in, and use it 
> to filter if it is passed in.
> We have seen data that can be reduced by many orders of magnitude using this 
> technique. This could be handled outside of Mahout but always produces better 
> performance and so this version of data-prep seems worth including.
> It does not affect the CLI version yet but could be included there in a 
> future Jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO

2016-10-01 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1883:
---
Issue Type: New Feature  (was: Bug)

> Create a type if IndexedDataset that filters unneeded data for CCO
> --
>
> Key: MAHOUT-1883
> URL: https://issues.apache.org/jira/browse/MAHOUT-1883
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Affects Versions: 0.13.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> The collaborative filtering CCO algo uses drms for each "indicator" type. The 
> input must have the same set of user-id and so the row rank for all input 
> matrices must be the same.
> In the past we have padded the row-id dictionary to include new rows only in 
> secondary matrices. This can lead to very large amounts of data processed in 
> the CCO pipeline that does not affect the results. Put another way if the row 
> doesn't exist in the primary matrix, there will be no cross-occurrence in the 
> other calculated cooccurrences matrix.
> if we are calculating P'P and P'S, S will not need rows that don't exist in P 
> so this Jira is to create an IndexedDataset companion object that takes an 
> RDD[(String, String)] of interactions but that uses the dictionary from P for 
> row-ids and filters out all data that doesn't correspond to P. The companion 
> object will create the row-ids dictionary if it is not passed in, and use it 
> to filter if it is passed in.
> We have seen data that can be reduced by many orders of magnitude using this 
> technique. This could be handled outside of Mahout but always produces better 
> performance and so this version of data-prep seems worth including.
> It does not affect the CLI version yet but could be included there in a 
> future Jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO

2016-10-01 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1883:
---
Sprint: Jan/Feb-2016

> Create a type if IndexedDataset that filters unneeded data for CCO
> --
>
> Key: MAHOUT-1883
> URL: https://issues.apache.org/jira/browse/MAHOUT-1883
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.13.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> The collaborative filtering CCO algo uses drms for each "indicator" type. The 
> input must have the same set of user-id and so the row rank for all input 
> matrices must be the same.
> In the past we have padded the row-id dictionary to include new rows only in 
> secondary matrices. This can lead to very large amounts of data processed in 
> the CCO pipeline that does not affect the results. Put another way if the row 
> doesn't exist in the primary matrix, there will be no cross-occurrence in the 
> other calculated cooccurrences matrix
> if we are calculating P'P and P'S, S will not need rows that don't exist in P 
> so this Jira is to create an IndexedDataset companion object that takes an 
> RDD[(String, String)] of interactions but that uses the dictionary from P for 
> row-ids and filters out all data that doesn't correspond to P. The companion 
> object will create the row-ids dictionary if it is not passed in, and use it 
> to filter if it is passed in.
> We have seen data that can be reduced by many orders of magnitude using this 
> technique. This could be handled outside of Mahout but always produces better 
> performance and so this version of data-prep seems worth including.
> It does not effect the CLI version yet but could be included there in a 
> future Jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO

2016-10-01 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1883:
---
Description: 
The collaborative filtering CCO algo uses drms for each "indicator" type. The 
input must have the same set of user-id and so the row rank for all input 
matrices must be the same.

In the past we have padded the row-id dictionary to include new rows only in 
secondary matrices. This can lead to very large amounts of data processed in 
the CCO pipeline that does not affect the results. Put another way if the row 
doesn't exist in the primary matrix, there will be no cross-occurrence in the 
other calculated cooccurrences matrix.

if we are calculating P'P and P'S, S will not need rows that don't exist in P 
so this Jira is to create an IndexedDataset companion object that takes an 
RDD[(String, String)] of interactions but that uses the dictionary from P for 
row-ids and filters out all data that doesn't correspond to P. The companion 
object will create the row-ids dictionary if it is not passed in, and use it to 
filter if it is passed in.

We have seen data that can be reduced by many orders of magnitude using this 
technique. This could be handled outside of Mahout but always produces better 
performance and so this version of data-prep seems worth including.

It does not affect the CLI version yet but could be included there in a future 
Jira.


  was:
The collaborative filtering CCO algo uses drms for each "indicator" type. The 
input must have the same set of user-id and so the row rank for all input 
matrices must be the same.

In the past we have padded the row-id dictionary to include new rows only in 
secondary matrices. This can lead to very large amounts of data processed in 
the CCO pipeline that does not affect the results. Put another way if the row 
doesn't exist in the primary matrix, there will be no cross-occurrence in the 
other calculated cooccurrences matrix

if we are calculating P'P and P'S, S will not need rows that don't exist in P 
so this Jira is to create an IndexedDataset companion object that takes an 
RDD[(String, String)] of interactions but that uses the dictionary from P for 
row-ids and filters out all data that doesn't correspond to P. The companion 
object will create the row-ids dictionary if it is not passed in, and use it to 
filter if it is passed in.

We have seen data that can be reduced by many orders of magnitude using this 
technique. This could be handled outside of Mahout but always produces better 
performance and so this version of data-prep seems worth including.

It does not effect the CLI version yet but could be included there in a future 
Jira.



> Create a type if IndexedDataset that filters unneeded data for CCO
> --
>
> Key: MAHOUT-1883
> URL: https://issues.apache.org/jira/browse/MAHOUT-1883
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering
>Affects Versions: 0.13.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> The collaborative filtering CCO algo uses drms for each "indicator" type. The 
> input must have the same set of user-id and so the row rank for all input 
> matrices must be the same.
> In the past we have padded the row-id dictionary to include new rows only in 
> secondary matrices. This can lead to very large amounts of data processed in 
> the CCO pipeline that does not affect the results. Put another way if the row 
> doesn't exist in the primary matrix, there will be no cross-occurrence in the 
> other calculated cooccurrences matrix.
> if we are calculating P'P and P'S, S will not need rows that don't exist in P 
> so this Jira is to create an IndexedDataset companion object that takes an 
> RDD[(String, String)] of interactions but that uses the dictionary from P for 
> row-ids and filters out all data that doesn't correspond to P. The companion 
> object will create the row-ids dictionary if it is not passed in, and use it 
> to filter if it is passed in.
> We have seen data that can be reduced by many orders of magnitude using this 
> technique. This could be handled outside of Mahout but always produces better 
> performance and so this version of data-prep seems worth including.
> It does not affect the CLI version yet but could be included there in a 
> future Jira.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO

2016-10-01 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1883:
--

 Summary: Create a type if IndexedDataset that filters unneeded 
data for CCO
 Key: MAHOUT-1883
 URL: https://issues.apache.org/jira/browse/MAHOUT-1883
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering
Affects Versions: 0.13.0
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 0.13.0


The collaborative filtering CCO algo uses drms for each "indicator" type. The 
input must have the same set of user-id and so the row rank for all input 
matrices must be the same.

In the past we have padded the row-id dictionary to include new rows only in 
secondary matrices. This can lead to very large amounts of data processed in 
the CCO pipeline that does not affect the results. Put another way if the row 
doesn't exist in the primary matrix, there will be no cross-occurrence in the 
other calculated cooccurrences matrix

if we are calculating P'P and P'S, S will not need rows that don't exist in P 
so this Jira is to create an IndexedDataset companion object that takes an 
RDD[(String, String)] of interactions but that uses the dictionary from P for 
row-ids and filters out all data that doesn't correspond to P. The companion 
object will create the row-ids dictionary if it is not passed in, and use it to 
filter if it is passed in.

We have seen data that can be reduced by many orders of magnitude using this 
technique. This could be handled outside of Mahout but always produces better 
performance and so this version of data-prep seems worth including.

It does not effect the CLI version yet but could be included there in a future 
Jira.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1878) implement quartile type thresholds for indicator matrix downsampling

2016-08-20 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429553#comment-15429553
 ] 

Pat Ferrel commented on MAHOUT-1878:


see discussion here
https://issues.apache.org/jira/browse/MAHOUT-1853

> implement quartile type thresholds for indicator matrix downsampling
> 
>
> Key: MAHOUT-1878
> URL: https://issues.apache.org/jira/browse/MAHOUT-1878
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering, cooccurrence
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 1.0.0
>
>
> https://issues.apache.org/jira/browse/MAHOUT-1853
> second half of the above, see discussion of downsampling by fraction of 
> matrix retained, perhaps using t-digest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MAHOUT-1679) example script run-item-sim should work on hdfs as well as local

2016-08-20 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1679:
---
Comment: was deleted

(was: see discussion https://issues.apache.org/jira/browse/MAHOUT-1853)

> example script run-item-sim should work on hdfs as well as local
> 
>
> Key: MAHOUT-1679
> URL: https://issues.apache.org/jira/browse/MAHOUT-1679
> Project: Mahout
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 0.10.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Minor
> Fix For: 1.0.0
>
>
> mahout/examples/bin/run-item-sim does not run on a cluster or pseudo-cluster 
> Spark + HDFS
> It prints a warning and how to run in cluster but should just work in either 
> mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1679) example script run-item-sim should work on hdfs as well as local

2016-08-20 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429552#comment-15429552
 ] 

Pat Ferrel commented on MAHOUT-1679:


see discussion https://issues.apache.org/jira/browse/MAHOUT-1853

> example script run-item-sim should work on hdfs as well as local
> 
>
> Key: MAHOUT-1679
> URL: https://issues.apache.org/jira/browse/MAHOUT-1679
> Project: Mahout
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 0.10.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Minor
> Fix For: 1.0.0
>
>
> mahout/examples/bin/run-item-sim does not run on a cluster or pseudo-cluster 
> Spark + HDFS
> It prints a warning and how to run in cluster but should just work in either 
> mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1878) implement quartile type thresholds for indicator matrix downsampling

2016-08-20 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1878:
--

 Summary: implement quartile type thresholds for indicator matrix 
downsampling
 Key: MAHOUT-1878
 URL: https://issues.apache.org/jira/browse/MAHOUT-1878
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering, cooccurrence
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 1.0.0


https://issues.apache.org/jira/browse/MAHOUT-1853

second half of the above, see discussion of downsampling by fraction of matrix 
retained, perhaps using t-digest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-08-20 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429550#comment-15429550
 ] 

Pat Ferrel commented on MAHOUT-1853:


ok first part implemented. Not sure Ted's suggestion will get into this release 
so I'm moving this Jira to not loose his comments. 

Finished the fixed threshold and number of indicators per item for every pair 
of matrices. So A'A can have an llr threshold as well as a # per row that is 
different than A'B and so forth.

> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-08-05 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409595#comment-15409595
 ] 

Pat Ferrel commented on MAHOUT-1853:


Great, that's what I wanted to hear. Normal in principal but something more 
tolerant to wonky distributions is worth trying and in this case we'll avoid 
doing it every time by saving the threshold for future runs. 

Thanks

> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-08-04 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408326#comment-15408326
 ] 

Pat Ferrel commented on MAHOUT-1853:


If t-digest is more tolerant of "not having enough data" than fitting the 
params of a normal dist then I'll do #1 and #2 now for 0.13.  Then for #3 will 
integrate t-digest as a way to calculate the threshold for #2 in the next 
phase. #3 would be the release after, which  would give us time to upgrade 
t-digest or cut it loose and treat as a dependency, it's in the maven repos.

> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-08-04 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408256#comment-15408256
 ] 

Pat Ferrel commented on MAHOUT-1853:


is rootLLR normally distributed (the positive half)? If so we'd have to 
calculate all rootLLR scores and fit the normal params to get the 10% or other 
adaptive threshold, right? 

I understand that O(n^2) never occurs in practice. Even for cases where O(k 
k_max n) is high intuition would say that this threshold could be calculated 
once and applied for some time since it will tend to stay the same for any 
specific type of indicator. Calculating it may be a once in a great while 
operation and the threshold would usually be used in #2 above.

I'm somewhat ignorant of t-digest other than having read your anomaly detection 
book. I think it's in Mahout but the docs are here: 
https://github.com/tdunning/t-digest. I assume that using t-digest would remove 
the need to do any separate distribution param fitting (as long as we use 
rootLLR) and could even be applied as online learning producing an adaptive 
threshold to feed into #2 above? I imagine it can also be applied periodically 
on P`X in batch.

No need to respond if I'm on the right track.



> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-08-04 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391126#comment-15391126
 ] 

Pat Ferrel edited comment on MAHOUT-1853 at 8/4/16 4:15 PM:


To reword this issue...

The CCO analysis code currently only employs a single # of values per row of 
the P’X matrices. This has proven an insufficient threshold for many of the 
possible cross-occurrence types. The problem is that for a user * item input 
matrix, which becomes an item * item output a fixed # per row is fine but the 
implementation is a bit meaningless when there are only 20 columns of the X 
matrix. For instance if X = C category preferences, there may be only 20 
possible categories and with a threshold of 100 and the fact that users often 
have enough usage to trigger preference events on all categories (though 
resulting in a small LLR value), the P’C matrix is almost completely full. This 
reduces any value in P’C.

There are several ways to address:
1) have a # of indicators per row threshold for every P'X matrix, not one for 
all (the current impl)
2) use a fixed LLR threshold value per matrix
3) use a confidence of correlation value (a % maybe) that is calculated from 
the data by looking at the distribution in P’C or other. This is potentially 
O(n^2) where n = number of items in the matrix. This may be practical to 
calculate for some types of data since n may be very small.

1 and 2 are easy in the extreme, #3 can actually be calculated after the fact 
and used in #2 even if it is not included in Mahout.

I've started work on #1 and #2

[~ssc][~tdunning] I'm especially looking for comments on #3 above, calculating 
a % confidence of correlation. The function we use for LLR scoring is 
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala#L210


was (Author: pferrel):
To reword this issue...

The CCO analysis code currently only employs a single # of values per row of 
the P’? matrices. This has proven an insufficient threshold for many of the 
possible cross-occurrence types. The problem is that for a user * item input 
matrix, which becomes an item * item output a fixed # per row is fine but the 
implementation is a bit meaningless when there are only 20 columns of the ? 
matrix. For instance if ? = C category preferences, there may be only 20 
possible categories and with a threshold of 100 and the fact that users often 
have enough usage to trigger preference events on all categories (though 
resulting in a small LLR value), the P’C matrix is almost completely full. This 
reduces any value in P’C.

There are several ways to address:
1) have a # of indicators per row threshold for every matrix, not one for all 
(the current impl)
2) use a fixed LLR threshold value per matrix
3) use a confidence of correlation value (a % maybe) that is calculated from 
the data by looking at the distribution in P’C or other. This is potentially 
O(n^2) where n = number of items in the matrix. This may be practical to 
calculate for some types of data since n may be very small.

1 and 2 are easy in the extreme, #3 can actually be calculated after the fact 
and used in #2 even if it is not included in Mahout.

starting work on #1 and #2

> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-08-04 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1853:
---
Sprint: Jan/Feb-2016

> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-07-24 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391126#comment-15391126
 ] 

Pat Ferrel commented on MAHOUT-1853:


To reword this issue...

The CCO analysis code currently only employs a single # of values per row of 
the P’? matrices. This has proven an insufficient threshold for many of the 
possible cross-occurrence types. The problem is that for a user * item input 
matrix, which becomes an item * item output a fixed # per row is fine but the 
implementation is a bit meaningless when there are only 20 columns of the ? 
matrix. For instance if ? = C category preferences, there may be only 20 
possible categories and with a threshold of 100 and the fact that users often 
have enough usage to trigger preference events on all categories (though 
resulting in a small LLR value), the P’C matrix is almost completely full. This 
reduces any value in P’C.

There are several ways to address:
1) have a # of indicators per row threshold for every matrix, not one for all 
(the current impl)
2) use a fixed LLR threshold value per matrix
3) use a confidence of correlation value (a % maybe) that is calculated from 
the data by looking at the distribution in P’C or other. This is potentially 
O(n^2) where n = number of items in the matrix. This may be practical to 
calculate for some types of data since n may be very small.

1 and 2 are easy in the extreme, #3 can actually be calculated after the fact 
and used in #2 even if it is not included in Mahout.

starting work on #1 and #2

> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-05-26 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371
 ] 

Pat Ferrel edited comment on MAHOUT-1853 at 5/26/16 5:04 PM:
-

Steps:

1) allow an array of absolute LLR value thresholds, one for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that 
non-correlation is rejected) or fraction of total cross-occurrences are 
retained after downsampling. To reduce how often this must be done the absolute 
value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. 
#1 also retains the O\(n) computation complexity. I imagine #1 would be used 
with #2 since #2 is much more computationally complex and can output thresholds 
for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low 
dimensionality in one direction or both. In fact this low dimensionality is the 
reason we need a different kind of downsampling for these pairs. Imagine a 
conversion A'A which is items by items and may be very large but sparse, then 
A'B may be products by gender, so a rank of 2 columns but much denser. 

The calculation for #2 would, I believe, require performing the un-downsampled 
A'A then determining the threshold from the LLR scores, then making another 
pass to downsample, this will add significant computation time and could make 
it impractical except for rare re-calculation tasks. In which case the absolute 
threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be 
better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for 
other uses of the CCO algorithm #2 may be required for occasional threshold 
re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome


was (Author: pferrel):
Steps:

1) allow an array of absolute LLR value thresholds, one for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that 
non-correlation is rejected) or fraction of total cross-occurrences are 
retained after downsampling. To reduce how often this must be done the absolute 
value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. 
#1 also retains the O(n) computation complexity. I imagine #1 would be used 
with #2 since #2 is much more computationally complex and can output thresholds 
for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low 
dimensionality in one direction or both. In fact this low dimensionality is the 
reason we need a different kind of downsampling for these pairs. Imagine a 
conversion A'A which is items by items and may be very large but sparse, then 
A'B may be products by gender, so a rank of 2 columns but much denser. 

The calculation for #2 would, I believe, require performing the un-downsampled 
A'A then determining the threshold from the LLR scores, then making another 
pass to downsample, this will add significant computation time and could make 
it impractical except for rare re-calculation tasks. In which case the absolute 
threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be 
better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for 
other uses of the CCO algorithm #2 may be required for occasional threshold 
re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome

> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-05-26 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371
 ] 

Pat Ferrel edited comment on MAHOUT-1853 at 5/26/16 5:03 PM:
-

Steps:

1) allow an array of absolute LLR value thresholds, one for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that 
non-correlation is rejected) or fraction of total cross-occurrences are 
retained after downsampling. To reduce how often this must be done the absolute 
value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. 
#1 also retains the O(n) computation complexity. I imagine #1 would be used 
with #2 since #2 is much more computationally complex and can output thresholds 
for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low 
dimensionality in one direction or both. In fact this low dimensionality is the 
reason we need a different kind of downsampling for these pairs. Imagine a 
conversion A'A which is items by items and may be very large but sparse, then 
A'B may be products by gender, so a rank of 2 columns but much denser. 

The calculation for #2 would, I believe, require performing the un-downsampled 
A'A then determining the threshold from the LLR scores, then making another 
pass to downsample, this will add significant computation time and could make 
it impractical except for rare re-calculation tasks. In which case the absolute 
threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be 
better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for 
other uses of the CCO algorithm #2 may be required for occasional threshold 
re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome


was (Author: pferrel):
Steps:

1) allow an array of absolute LLR value thresholds for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that 
non-correlation is rejected) or fraction of total cross-occurrences are 
retained after downsampling. To reduce how often this must be done the absolute 
value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. 
#1 also retains the O(n) computation complexity. I imagine #1 would be used 
with #2 since #2 is much more computationally complex and can output thresholds 
for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low 
dimensionality in one direction or both. In fact this low dimensionality is the 
reason we need a different kind of downsampling for these pairs. Imagine a 
conversion A'A which is items by items and may be very large but sparse, then 
A'B may be products by gender, so a rank of 2 columns but much denser. 

The calculation for #2 would, I believe, require performing the un-downsampled 
A'A then determining the threshold from the LLR scores, then making another 
pass to downsample, this will add significant computation time and could make 
it impractical except for rare re-calculation tasks. In which case the absolute 
threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be 
better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for 
other uses of the CCO algorithm #2 may be required for occasional threshold 
re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome

> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)

2016-05-26 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371
 ] 

Pat Ferrel commented on MAHOUT-1853:


Steps:

1) allow an array of absolute LLR value thresholds for each matrix pair
2) allow thresholds to be a confidence of correlation (actually confidence that 
non-correlation is rejected) or fraction of total cross-occurrences are 
retained after downsampling. To reduce how often this must be done the absolute 
value thresholds should be output after calculation for later re-use in #1

#1 is very easy but not all that useful since LLR values will vary quite a bit. 
#1 also retains the O(n) computation complexity. I imagine #1 would be used 
with #2 since #2 is much more computationally complex and can output thresholds 
for #1.

#2 require worst-case O(n^2) complexity. Some matrix pairs will have low 
dimensionality in one direction or both. In fact this low dimensionality is the 
reason we need a different kind of downsampling for these pairs. Imagine a 
conversion A'A which is items by items and may be very large but sparse, then 
A'B may be products by gender, so a rank of 2 columns but much denser. 

The calculation for #2 would, I believe, require performing the un-downsampled 
A'A then determining the threshold from the LLR scores, then making another 
pass to downsample, this will add significant computation time and could make 
it impractical except for rare re-calculation tasks. In which case the absolute 
threshold would be recorded and used for subsequent A'A and A'B using #1.

Since it is likely to be impractical to calculate #2 very often it may be 
better done as an analytics job rather than part of the A'A job.

For most recommender cases the current downsampling method is fine but for 
other uses of the CCO algorithm #2 may be required for occasional threshold 
re-calc. In some sense we won't know until we try.

Any comments from [~tdunning] or [~dlyubimov] would be welcome

> Improvements to CCO (Correlated Cross-Occurrence)
> -
>
> Key: MAHOUT-1853
> URL: https://issues.apache.org/jira/browse/MAHOUT-1853
> Project: Mahout
>  Issue Type: New Feature
>Affects Versions: 0.12.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.13.0
>
>
> Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold 
> calculation for LLR downsampling, and possible multiple fixed thresholds for 
> A’A, A’B etc. This is to account for the vast difference in dimensionality 
> between indicator types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MAHOUT-1766) Increase default PermGen size for spark-shell

2016-03-20 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel reassigned MAHOUT-1766:
--

Assignee: Andrew Palumbo  (was: Pat Ferrel)

I don't use the shell much, is this legit Andy?

> Increase default PermGen size for spark-shell
> -
>
> Key: MAHOUT-1766
> URL: https://issues.apache.org/jira/browse/MAHOUT-1766
> Project: Mahout
>  Issue Type: Improvement
>  Components: Mahout spark shell
>Affects Versions: 0.11.0
>Reporter: Sergey Tryuber
>Assignee: Andrew Palumbo
> Fix For: 0.12.0
>
>
> Mahout spark-shell is run with default perm gen size (64MB). Taking into 
> account that it depends on lots of external jars and the whole count of used 
> Java classes is very large, we constantly observe spontaneous corresponding 
> OOM exceptions.
> A hot fix from our side is to modify envelope bash script (added 
> -XX:PermSize=512m):
> {code}
> "$JAVA" $JAVA_HEAP_MAX -XX:PermSize=512m $MAHOUT_OPTS -classpath "$CLASSPATH" 
> "org.apache.mahout.sparkbindings.shell.Main" $@
> {code}
> Of course, more elegant solution is needed. After the applied fix, the errors 
> had gone.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1689) Create a doc on how to write an app that uses Mahout as a lib

2016-03-19 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1689.

Resolution: Fixed

> Create a doc on how to write an app that uses Mahout as a lib
> -
>
> Key: MAHOUT-1689
> URL: https://issues.apache.org/jira/browse/MAHOUT-1689
> Project: Mahout
>  Issue Type: Documentation
>Affects Versions: 0.10.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 1.0.0
>
>
> Create a doc on how to write an app that uses Mahout as a lib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1799) Read null row vectors from file in TextDelimeterReaderWriter driver

2016-03-19 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1799:
---
Fix Version/s: (was: 0.12.0)
   1.0.0

> Read null row vectors from file in TextDelimeterReaderWriter driver
> ---
>
> Key: MAHOUT-1799
> URL: https://issues.apache.org/jira/browse/MAHOUT-1799
> Project: Mahout
>  Issue Type: Improvement
>  Components: spark
>Reporter: Jussi Jousimo
>Assignee: Pat Ferrel
>Priority: Minor
> Fix For: 1.0.0
>
>
> Since some row vectors in a sparse matrix can be null, Mahout writes them out 
> to a file with the row label only. However, Mahout cannot read these files, 
> but throws an exception when it encounters a label-only row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1679) example script run-item-sim should work on hdfs as well as local

2016-03-19 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1679:
---
Fix Version/s: (was: 0.12.0)
   1.0.0
   Issue Type: Improvement  (was: Bug)

> example script run-item-sim should work on hdfs as well as local
> 
>
> Key: MAHOUT-1679
> URL: https://issues.apache.org/jira/browse/MAHOUT-1679
> Project: Mahout
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 0.10.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Minor
> Fix For: 1.0.0
>
>
> mahout/examples/bin/run-item-sim does not run on a cluster or pseudo-cluster 
> Spark + HDFS
> It prints a warning and how to run in cluster but should just work in either 
> mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup

2016-03-19 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199694#comment-15199694
 ] 

Pat Ferrel commented on MAHOUT-1762:


Do you know of something that is blocked by this? Not sure what is being asked 
for.

> Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
> ---
>
> Key: MAHOUT-1762
> URL: https://issues.apache.org/jira/browse/MAHOUT-1762
> Project: Mahout
>  Issue Type: Improvement
>  Components: spark
>Reporter: Sergey Tryuber
>Assignee: Pat Ferrel
> Fix For: 1.0.0
>
>
> [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties]
>  is aimed to contain global configuration for Spark cluster. For example, in 
> our HDP2.2 environment it contains:
> {noformat}
> spark.driver.extraJavaOptions  -Dhdp.version=2.2.0.0–2041
> spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
> {noformat}
> and there are many other good things. Actually it is expected that when a 
> user starts Spark Shell, it will be working fine. Unfortunately this does not 
> happens with Mahout Spark Shell, because it ignores spark configuration and 
> user has to copy-past lots of options into _MAHOUT_OPTS_.
> This happens because 
> [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala]
>  is executed directly in [initialization 
> script|https://github.com/apache/mahout/blob/master/bin/mahout]:
> {code}
> "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" 
> "org.apache.mahout.sparkbindings.shell.Main" $@
> {code}
> In contrast, in Spark shell is indirectly invoked through spark-submit in 
> [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] 
> script:
> {code}
> "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@"
> {code}
> [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala]
>  contains an additional initialization layer for loading properties file (see 
> SparkSubmitArguments#mergeDefaultSparkProperties method).
> So there are two possible solutions:
> * use proper Spark-like initialization logic
> * use thin envelope like it is in H2O Sparkling Water 
> ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup

2016-03-19 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199672#comment-15199672
 ] 

Pat Ferrel commented on MAHOUT-1762:


I agree with the reasoning for this but the drivers have a pass-through to 
Spark for arbitrary key=value pairs and switching to sparksubmit was voted down 
so it was never done. If you are using Mahout as a lib you can set anything in 
the SparkConf that you want so not sure what is remaining here but a more than 
reasonable complaint about how the launcher scripts are structured.

> Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
> ---
>
> Key: MAHOUT-1762
> URL: https://issues.apache.org/jira/browse/MAHOUT-1762
> Project: Mahout
>  Issue Type: Improvement
>  Components: spark
>Reporter: Sergey Tryuber
>Assignee: Pat Ferrel
> Fix For: 1.0.0
>
>
> [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties]
>  is aimed to contain global configuration for Spark cluster. For example, in 
> our HDP2.2 environment it contains:
> {noformat}
> spark.driver.extraJavaOptions  -Dhdp.version=2.2.0.0–2041
> spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
> {noformat}
> and there are many other good things. Actually it is expected that when a 
> user starts Spark Shell, it will be working fine. Unfortunately this does not 
> happens with Mahout Spark Shell, because it ignores spark configuration and 
> user has to copy-past lots of options into _MAHOUT_OPTS_.
> This happens because 
> [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala]
>  is executed directly in [initialization 
> script|https://github.com/apache/mahout/blob/master/bin/mahout]:
> {code}
> "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" 
> "org.apache.mahout.sparkbindings.shell.Main" $@
> {code}
> In contrast, in Spark shell is indirectly invoked through spark-submit in 
> [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] 
> script:
> {code}
> "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@"
> {code}
> [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala]
>  contains an additional initialization layer for loading properties file (see 
> SparkSubmitArguments#mergeDefaultSparkProperties method).
> So there are two possible solutions:
> * use proper Spark-like initialization logic
> * use thin envelope like it is in H2O Sparkling Water 
> ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1689) Create a doc on how to write an app that uses Mahout as a lib

2016-03-19 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199754#comment-15199754
 ] 

Pat Ferrel commented on MAHOUT-1689:


Done, many times over by several people. Mine is here: 
http://mahout.apache.org/users/environment/how-to-build-an-app.html

> Create a doc on how to write an app that uses Mahout as a lib
> -
>
> Key: MAHOUT-1689
> URL: https://issues.apache.org/jira/browse/MAHOUT-1689
> Project: Mahout
>  Issue Type: Documentation
>Affects Versions: 0.10.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 1.0.0
>
>
> Create a doc on how to write an app that uses Mahout as a lib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1788) spark-itemsimilarity integration test script cleanup

2016-03-19 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1788:
---
Fix Version/s: (was: 0.12.0)
   1.0.0
   Issue Type: Improvement  (was: Bug)

work on this as time is available, not blocking anything IMO

> spark-itemsimilarity integration test script cleanup
> 
>
> Key: MAHOUT-1788
> URL: https://issues.apache.org/jira/browse/MAHOUT-1788
> Project: Mahout
>  Issue Type: Improvement
>  Components: cooccurrence
>Affects Versions: 0.11.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Trivial
> Fix For: 1.0.0
>
>
> binary release does not contain data for itemsimilarity tests, neith binary 
> nor source versions will run on a cluster unless data is hand copied to hdfs.
> Clean this up so it copies data if needed and the data is in both versions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1799) Read null row vectors from file in TextDelimeterReaderWriter driver

2016-03-19 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199770#comment-15199770
 ] 

Pat Ferrel commented on MAHOUT-1799:


Can't test this or even merge it right now so if someone else can merge, great 
otherwise is doesn't seem like a requirement for release and so unless someone 
speaks up I'll push to 1.0

> Read null row vectors from file in TextDelimeterReaderWriter driver
> ---
>
> Key: MAHOUT-1799
> URL: https://issues.apache.org/jira/browse/MAHOUT-1799
> Project: Mahout
>  Issue Type: Improvement
>  Components: spark
>Reporter: Jussi Jousimo
>Assignee: Pat Ferrel
>Priority: Minor
> Fix For: 1.0.0
>
>
> Since some row vectors in a sparse matrix can be null, Mahout writes them out 
> to a file with the row label only. However, Mahout cannot read these files, 
> but throws an exception when it encounters a label-only row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup

2016-03-19 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1762.

Resolution: Won't Fix

We don't know of anything this blocks and moving to using sparksubmit was voted 
down, which only applies to Mahout CLI drivers anyway. All CLI drivers support 
passthrough of arbitrary key=value pairs, which go into the SparkConf and when 
using Mahout as a Lib you can create any arbitrary SparkConf.

Will not fix unless someone can explain the need. 

> Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
> ---
>
> Key: MAHOUT-1762
> URL: https://issues.apache.org/jira/browse/MAHOUT-1762
> Project: Mahout
>  Issue Type: Improvement
>  Components: spark
>Reporter: Sergey Tryuber
>Assignee: Pat Ferrel
> Fix For: 1.0.0
>
>
> [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties]
>  is aimed to contain global configuration for Spark cluster. For example, in 
> our HDP2.2 environment it contains:
> {noformat}
> spark.driver.extraJavaOptions  -Dhdp.version=2.2.0.0–2041
> spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
> {noformat}
> and there are many other good things. Actually it is expected that when a 
> user starts Spark Shell, it will be working fine. Unfortunately this does not 
> happens with Mahout Spark Shell, because it ignores spark configuration and 
> user has to copy-past lots of options into _MAHOUT_OPTS_.
> This happens because 
> [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala]
>  is executed directly in [initialization 
> script|https://github.com/apache/mahout/blob/master/bin/mahout]:
> {code}
> "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" 
> "org.apache.mahout.sparkbindings.shell.Main" $@
> {code}
> In contrast, in Spark shell is indirectly invoked through spark-submit in 
> [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] 
> script:
> {code}
> "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@"
> {code}
> [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala]
>  contains an additional initialization layer for loading properties file (see 
> SparkSubmitArguments#mergeDefaultSparkProperties method).
> So there are two possible solutions:
> * use proper Spark-like initialization logic
> * use thin envelope like it is in H2O Sparkling Water 
> ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1679) example script run-item-sim should work on hdfs as well as local

2016-03-18 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199746#comment-15199746
 ] 

Pat Ferrel commented on MAHOUT-1679:


this is just a test script that doesn't account for using HDFS and expect 
localfs, so not important.

> example script run-item-sim should work on hdfs as well as local
> 
>
> Key: MAHOUT-1679
> URL: https://issues.apache.org/jira/browse/MAHOUT-1679
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 0.10.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Minor
> Fix For: 1.0.0
>
>
> mahout/examples/bin/run-item-sim does not run on a cluster or pseudo-cluster 
> Spark + HDFS
> It prints a warning and how to run in cluster but should just work in either 
> mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1788) spark-itemsimilarity integration test script cleanup

2015-11-06 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1788:
--

 Summary: spark-itemsimilarity integration test script cleanup
 Key: MAHOUT-1788
 URL: https://issues.apache.org/jira/browse/MAHOUT-1788
 Project: Mahout
  Issue Type: Bug
  Components: cooccurrence
Affects Versions: 0.11.0
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Priority: Trivial
 Fix For: 0.12.0


binary release does not contain data for itemsimilarity tests, neith binary nor 
source versions will run on a cluster unless data is hand copied to hdfs.

Clean this up so it copies data if needed and the data is in both versions. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1785) Replace 'spark.kryoserializer.buffer.mb' from Spark config

2015-11-06 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994178#comment-14994178
 ] 

Pat Ferrel commented on MAHOUT-1785:


This happens because Spark is changing the way a conf param is being used. The 
warning seems to apply into Spark 1.6-SNAPSHOT so if it works, not a blocker 
and we have to have it for Spark 1.4.1 or below.

So we can't do anything about this until we require 1.5.1, which is not in 
Mahout 0.11.1 so defer this.

> Replace 'spark.kryoserializer.buffer.mb' from Spark config
> --
>
> Key: MAHOUT-1785
> URL: https://issues.apache.org/jira/browse/MAHOUT-1785
> Project: Mahout
>  Issue Type: Improvement
>  Components: Mahout spark shell
>Affects Versions: 0.11.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Trivial
> Fix For: 0.12.0
>
>
> 'spark.kryoserializer.buffer.mb' has been deprecated as of spark 1.4 and 
> should be replaced by 'spark.kryoserializer.buffer'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1785) Replace 'spark.kryoserializer.buffer.mb' from Spark config

2015-11-06 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1785:
---
Fix Version/s: (was: 0.11.1)
   0.12.0

> Replace 'spark.kryoserializer.buffer.mb' from Spark config
> --
>
> Key: MAHOUT-1785
> URL: https://issues.apache.org/jira/browse/MAHOUT-1785
> Project: Mahout
>  Issue Type: Improvement
>  Components: Mahout spark shell
>Affects Versions: 0.11.0
>Reporter: Suneel Marthi
>Assignee: Suneel Marthi
>Priority: Trivial
> Fix For: 0.12.0
>
>
> 'spark.kryoserializer.buffer.mb' has been deprecated as of spark 1.4 and 
> should be replaced by 'spark.kryoserializer.buffer'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1618) Cooccurrence Recommender example and documentation

2015-11-05 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1618.

Resolution: Fixed

> Cooccurrence Recommender example and documentation 
> ---
>
> Key: MAHOUT-1618
> URL: https://issues.apache.org/jira/browse/MAHOUT-1618
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
>Reporter: Thejas Prasad
>Assignee: Pat Ferrel
>Priority: Trivial
>  Labels: DSL, cooccurence, scala, spark
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1618) Cooccurrence Recommender example and documentation

2015-11-05 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992930#comment-14992930
 ] 

Pat Ferrel commented on MAHOUT-1618:


full-featured "Universal Recommender" using Mahout cooccurrence with 
multimodality up the yazoo. Apache 2 licence.

https://github.com/PredictionIO/template-scala-parallel-universal-recommendation

Docs on the main site for the PIO framework and the README.md for the 
recommender

> Cooccurrence Recommender example and documentation 
> ---
>
> Key: MAHOUT-1618
> URL: https://issues.apache.org/jira/browse/MAHOUT-1618
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
>Reporter: Thejas Prasad
>Assignee: Pat Ferrel
>Priority: Trivial
>  Labels: DSL, cooccurence, scala, spark
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup

2015-11-05 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1762:
---
Fix Version/s: (was: 0.12.0)
   1.0.0

> Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
> ---
>
> Key: MAHOUT-1762
> URL: https://issues.apache.org/jira/browse/MAHOUT-1762
> Project: Mahout
>  Issue Type: Wish
>  Components: spark
>Reporter: Sergey Tryuber
> Fix For: 1.0.0
>
>
> [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties]
>  is aimed to contain global configuration for Spark cluster. For example, in 
> our HDP2.2 environment it contains:
> {noformat}
> spark.driver.extraJavaOptions  -Dhdp.version=2.2.0.0–2041
> spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
> {noformat}
> and there are many other good things. Actually it is expected that when a 
> user starts Spark Shell, it will be working fine. Unfortunately this does not 
> happens with Mahout Spark Shell, because it ignores spark configuration and 
> user has to copy-past lots of options into _MAHOUT_OPTS_.
> This happens because 
> [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala]
>  is executed directly in [initialization 
> script|https://github.com/apache/mahout/blob/master/bin/mahout]:
> {code}
> "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" 
> "org.apache.mahout.sparkbindings.shell.Main" $@
> {code}
> In contrast, in Spark shell is indirectly invoked through spark-submit in 
> [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] 
> script:
> {code}
> "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@"
> {code}
> [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala]
>  contains an additional initialization layer for loading properties file (see 
> SparkSubmitArguments#mergeDefaultSparkProperties method).
> So there are two possible solutions:
> * use proper Spark-like initialization logic
> * use thin envelope like it is in H2O Sparkling Water 
> ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup

2015-11-05 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1762:
---
Fix Version/s: 0.12.0

> Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
> ---
>
> Key: MAHOUT-1762
> URL: https://issues.apache.org/jira/browse/MAHOUT-1762
> Project: Mahout
>  Issue Type: Wish
>  Components: spark
>Reporter: Sergey Tryuber
> Fix For: 0.12.0
>
>
> [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties]
>  is aimed to contain global configuration for Spark cluster. For example, in 
> our HDP2.2 environment it contains:
> {noformat}
> spark.driver.extraJavaOptions  -Dhdp.version=2.2.0.0–2041
> spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
> {noformat}
> and there are many other good things. Actually it is expected that when a 
> user starts Spark Shell, it will be working fine. Unfortunately this does not 
> happens with Mahout Spark Shell, because it ignores spark configuration and 
> user has to copy-past lots of options into _MAHOUT_OPTS_.
> This happens because 
> [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala]
>  is executed directly in [initialization 
> script|https://github.com/apache/mahout/blob/master/bin/mahout]:
> {code}
> "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" 
> "org.apache.mahout.sparkbindings.shell.Main" $@
> {code}
> In contrast, in Spark shell is indirectly invoked through spark-submit in 
> [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] 
> script:
> {code}
> "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@"
> {code}
> [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala]
>  contains an additional initialization layer for loading properties file (see 
> SparkSubmitArguments#mergeDefaultSparkProperties method).
> So there are two possible solutions:
> * use proper Spark-like initialization logic
> * use thin envelope like it is in H2O Sparkling Water 
> ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup

2015-11-05 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992093#comment-14992093
 ] 

Pat Ferrel commented on MAHOUT-1762:


Very good point. We need to move to spark-submit and away from directly 
creating the Spark context IMHO. I'd vote to put reworking the launcher code 
for the shell and drivers on the roadmap for 0.12.0.

> Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
> ---
>
> Key: MAHOUT-1762
> URL: https://issues.apache.org/jira/browse/MAHOUT-1762
> Project: Mahout
>  Issue Type: Wish
>  Components: spark
>Reporter: Sergey Tryuber
>
> [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties]
>  is aimed to contain global configuration for Spark cluster. For example, in 
> our HDP2.2 environment it contains:
> {noformat}
> spark.driver.extraJavaOptions  -Dhdp.version=2.2.0.0–2041
> spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041
> {noformat}
> and there are many other good things. Actually it is expected that when a 
> user starts Spark Shell, it will be working fine. Unfortunately this does not 
> happens with Mahout Spark Shell, because it ignores spark configuration and 
> user has to copy-past lots of options into _MAHOUT_OPTS_.
> This happens because 
> [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala]
>  is executed directly in [initialization 
> script|https://github.com/apache/mahout/blob/master/bin/mahout]:
> {code}
> "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" 
> "org.apache.mahout.sparkbindings.shell.Main" $@
> {code}
> In contrast, in Spark shell is indirectly invoked through spark-submit in 
> [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] 
> script:
> {code}
> "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@"
> {code}
> [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala]
>  contains an additional initialization layer for loading properties file (see 
> SparkSubmitArguments#mergeDefaultSparkProperties method).
> So there are two possible solutions:
> * use proper Spark-like initialization logic
> * use thin envelope like it is in H2O Sparkling Water 
> ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MAHOUT-1618) Cooccurrence Recommender example and documentation

2015-08-11 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692591#comment-14692591
 ] 

Pat Ferrel edited comment on MAHOUT-1618 at 8/12/15 12:07 AM:
--

just created a project using PredictionIO's framework which integrates Spark, 
HBase, and Elasticsearch. Added Mahout cooccurrence and implemented the rest of 
the recommender. 

This is not only an OSS integration example but a running virtually turnkey 
recommender.

I could update the item and row similarity docs on Mahout a bit and point to 
the "template" as an example.

A new version will be released in a week or so that uses Mahout 0.11.0


was (Author: pferrel):
just created a project using PredictionIO's framework which integrates Spark, 
HBase, and Elasticsearch. Added Mahout cooccurrence and implemented the rest of 
the recommender. 

This is not only an OSS integration example but a running virtually turnkey 
recommender.

I could update the item and row similarity docs on Mahout a bit and point to 
the "template" as an example.

> Cooccurrence Recommender example and documentation 
> ---
>
> Key: MAHOUT-1618
> URL: https://issues.apache.org/jira/browse/MAHOUT-1618
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
>Reporter: Thejas Prasad
>Assignee: Pat Ferrel
>Priority: Trivial
>  Labels: DSL, cooccurence, scala, spark
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1618) Cooccurrence Recommender example and documentation

2015-08-11 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692591#comment-14692591
 ] 

Pat Ferrel commented on MAHOUT-1618:


just created a project using PredictionIO's framework which integrates Spark, 
HBase, and Elasticsearch. Added Mahout cooccurrence and implemented the rest of 
the recommender. 

This is not only an OSS integration example but a running virtually turnkey 
recommender.

I could update the item and row similarity docs on Mahout a bit and point to 
the "template" as an example.

> Cooccurrence Recommender example and documentation 
> ---
>
> Key: MAHOUT-1618
> URL: https://issues.apache.org/jira/browse/MAHOUT-1618
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
>Reporter: Thejas Prasad
>Assignee: Pat Ferrel
>Priority: Trivial
>  Labels: DSL, cooccurence, scala, spark
> Fix For: 0.11.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1641) Add conversion from a RDD[(String, String)] to a Drm[Int]

2015-06-02 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1641.

   Resolution: Implemented
Fix Version/s: 0.10.1

Actually implemented this before I saw this Jira. 

> Add conversion from a RDD[(String, String)] to a Drm[Int]
> -
>
> Key: MAHOUT-1641
> URL: https://issues.apache.org/jira/browse/MAHOUT-1641
> Project: Mahout
>  Issue Type: Question
>  Components: spark
>Affects Versions: 0.9
>Reporter: Erlend Hamnaberg
>Assignee: Pat Ferrel
>  Labels: DSL, scala, spark
> Fix For: 0.10.1, 0.11.0
>
>
> Hi.
> We are using the coocurrence part of mahout as a library. We get our data 
> from other sources, like for instance Cassandra. We dont want to write that 
> data to disk, and read it back since we already have the data on each slave.
> I have created some conversion functions based on one of the 
> IndexedDatasetSpark readers, cant remember which one at the moment.
> Is there interest in the community for this kind of feature? I can probably 
> clean it up and add this as a github pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (MAHOUT-1641) Add conversion from a RDD[(String, String)] to a Drm[Int]

2015-06-02 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel reopened MAHOUT-1641:

  Assignee: Pat Ferrel  (was: Dmitriy Lyubimov)

> Add conversion from a RDD[(String, String)] to a Drm[Int]
> -
>
> Key: MAHOUT-1641
> URL: https://issues.apache.org/jira/browse/MAHOUT-1641
> Project: Mahout
>  Issue Type: Question
>  Components: spark
>Affects Versions: 0.9
>Reporter: Erlend Hamnaberg
>Assignee: Pat Ferrel
>  Labels: DSL, scala, spark
> Fix For: 0.11.0
>
>
> Hi.
> We are using the coocurrence part of mahout as a library. We get our data 
> from other sources, like for instance Cassandra. We dont want to write that 
> data to disk, and read it back since we already have the data on each slave.
> I have created some conversion functions based on one of the 
> IndexedDatasetSpark readers, cant remember which one at the moment.
> Is there interest in the community for this kind of feature? I can probably 
> clean it up and add this as a github pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1641) Add conversion from a RDD[(String, String)] to a Drm[Int]

2015-06-02 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570044#comment-14570044
 ] 

Pat Ferrel commented on MAHOUT-1641:


Hmm didn't see this earlier. There is now a secondary "apply" constructor in 
the companion object for IndexedDatasetSpark that takes an RDD[(String, 
String)].

See here: 
https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/sparkbindings/indexeddataset/IndexedDatasetSpark.scala

> Add conversion from a RDD[(String, String)] to a Drm[Int]
> -
>
> Key: MAHOUT-1641
> URL: https://issues.apache.org/jira/browse/MAHOUT-1641
> Project: Mahout
>  Issue Type: Question
>  Components: spark
>Affects Versions: 0.9
>Reporter: Erlend Hamnaberg
>Assignee: Dmitriy Lyubimov
>  Labels: DSL, scala, spark
> Fix For: 0.11.0
>
>
> Hi.
> We are using the coocurrence part of mahout as a library. We get our data 
> from other sources, like for instance Cassandra. We dont want to write that 
> data to disk, and read it back since we already have the data on each slave.
> I have created some conversion functions based on one of the 
> IndexedDatasetSpark readers, cant remember which one at the moment.
> Is there interest in the community for this kind of feature? I can probably 
> clean it up and add this as a github pull request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1707) Spark-itemsimilarity uses too much memory

2015-05-22 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1707.

Resolution: Fixed

removed bad collect.

> Spark-itemsimilarity uses too much memory
> -
>
> Key: MAHOUT-1707
> URL: https://issues.apache.org/jira/browse/MAHOUT-1707
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering, cooccurrence
>Affects Versions: 0.10.0
> Environment: Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 0.10.1
>
>
> java.lang.OutOfMemoryError: Java heap space
> The code has an unnecessary .collect(), forcing all interaction data into 
> memory of the client/driver. Increasing the executor memory will not help 
> with this.
> remove this line and rebuild Mahout.
> https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157
> The errant line reads:
> interactions.collect()
> This forces the user action data into memory, a bad thing for memory 
> consumption. Removing it should allow for better Spark memory management.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1708) Replace Google/Guava in mahout-math and mahout-hdfs

2015-05-19 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1708:
---
Summary: Replace Google/Guava in mahout-math and mahout-hdfs  (was: Replace 
Preconditions with asserts for Spark code)

> Replace Google/Guava in mahout-math and mahout-hdfs
> ---
>
> Key: MAHOUT-1708
> URL: https://issues.apache.org/jira/browse/MAHOUT-1708
> Project: Mahout
>  Issue Type: Bug
>  Components: Hdfs, Math
>Affects Versions: 0.10.0
> Environment: Spark
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
> Fix For: 0.10.1
>
>
> all use of guava has been removed from the code used with Spark except the 
> use of Preconditions. These are pretty easy to replace. 
> 1) remove guava from mahout-math, mahout-hdfs, poms and the spark 
> dependency-reduced assembly.
> 2) you will now get compile errors for math and hdfs so remove the imports 
> and replace and Preconditions with asserts.
> Not sure how many errors in replacing these will be caught with unit tests so 
> be careful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MAHOUT-1708) Replace Preconditions with asserts for Spark code

2015-05-19 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550671#comment-14550671
 ] 

Pat Ferrel edited comment on MAHOUT-1708 at 5/19/15 4:01 PM:
-

AbstractIterator and Map uses from guava are also problems here (Andrew's 
comments on IM)

[~andrew.musselman] can you create a PR branch so others can help with this?


was (Author: pferrel):
[~andrew.musselman] can you create a PR branch so others can help with this?

> Replace Preconditions with asserts for Spark code
> -
>
> Key: MAHOUT-1708
> URL: https://issues.apache.org/jira/browse/MAHOUT-1708
> Project: Mahout
>  Issue Type: Bug
>  Components: Hdfs, Math
>Affects Versions: 0.10.0
> Environment: Spark
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
> Fix For: 0.10.1
>
>
> all use of guava has been removed from the code used with Spark except the 
> use of Preconditions. These are pretty easy to replace. 
> 1) remove guava from mahout-math, mahout-hdfs, poms and the spark 
> dependency-reduced assembly.
> 2) you will now get compile errors for math and hdfs so remove the imports 
> and replace and Preconditions with asserts.
> Not sure how many errors in replacing these will be caught with unit tests so 
> be careful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1708) Replace Preconditions with asserts for Spark code

2015-05-19 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550671#comment-14550671
 ] 

Pat Ferrel commented on MAHOUT-1708:


[~andrew.musselman] can you create a PR branch so others can help with this?

> Replace Preconditions with asserts for Spark code
> -
>
> Key: MAHOUT-1708
> URL: https://issues.apache.org/jira/browse/MAHOUT-1708
> Project: Mahout
>  Issue Type: Bug
>  Components: Hdfs, Math
>Affects Versions: 0.10.0
> Environment: Spark
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
> Fix For: 0.10.1
>
>
> all use of guava has been removed from the code used with Spark except the 
> use of Preconditions. These are pretty easy to replace. 
> 1) remove guava from mahout-math, mahout-hdfs, poms and the spark 
> dependency-reduced assembly.
> 2) you will now get compile errors for math and hdfs so remove the imports 
> and replace and Preconditions with asserts.
> Not sure how many errors in replacing these will be caught with unit tests so 
> be careful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1708) Replace Preconditions with asserts for Spark code

2015-05-18 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1708:
--

 Summary: Replace Preconditions with asserts for Spark code
 Key: MAHOUT-1708
 URL: https://issues.apache.org/jira/browse/MAHOUT-1708
 Project: Mahout
  Issue Type: Bug
  Components: Hdfs, Math
Affects Versions: 0.10.0
 Environment: Spark
Reporter: Pat Ferrel
Assignee: Andrew Musselman
 Fix For: 0.10.1


all use of guava has been removed from the code used with Spark except the use 
of Preconditions. These are pretty easy to replace. 

1) remove guava from mahout-math, mahout-hdfs, poms and the spark 
dependency-reduced assembly.
2) you will now get compile errors for math and hdfs so remove the imports and 
replace and Preconditions with asserts.

Not sure how many errors in replacing these will be caught with unit tests so 
be careful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1707) Spark-itemsimilarity uses too much memory

2015-05-13 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1707:
---
Description: 
java.lang.OutOfMemoryError: Java heap space

The code has an unnecessary .collect(), forcing all interaction data into 
memory of the client/driver. Increasing the executor memory will not help with 
this.

remove this line and rebuild Mahout.
https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157

The errant line reads:

interactions.collect()

This forces the user action data into memory, a bad thing for memory 
consumption. Removing it should allow for better Spark memory management.

  was:
java.lang.OutOfMemoryError: Java heap space

The code has an unnecessary .collect(), forcing all interaction data into 
memory of the client/driver. Increasing the executor memory will not help with 
this.

remove this line and rebuild Mahout.
https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157

The errant line reads:

interactions.collect()

This forces the user action data into memory, a bad thing for memory 
consumption.


> Spark-itemsimilarity uses too much memory
> -
>
> Key: MAHOUT-1707
> URL: https://issues.apache.org/jira/browse/MAHOUT-1707
> Project: Mahout
>  Issue Type: Bug
>  Components: Collaborative Filtering, cooccurrence
>Affects Versions: 0.10.0
> Environment: Spark
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
> Fix For: 0.10.1
>
>
> java.lang.OutOfMemoryError: Java heap space
> The code has an unnecessary .collect(), forcing all interaction data into 
> memory of the client/driver. Increasing the executor memory will not help 
> with this.
> remove this line and rebuild Mahout.
> https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157
> The errant line reads:
> interactions.collect()
> This forces the user action data into memory, a bad thing for memory 
> consumption. Removing it should allow for better Spark memory management.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1707) Spark-itemsimilarity uses too much memory

2015-05-13 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1707:
--

 Summary: Spark-itemsimilarity uses too much memory
 Key: MAHOUT-1707
 URL: https://issues.apache.org/jira/browse/MAHOUT-1707
 Project: Mahout
  Issue Type: Bug
  Components: Collaborative Filtering, cooccurrence
Affects Versions: 0.10.0
 Environment: Spark
Reporter: Pat Ferrel
Assignee: Pat Ferrel
 Fix For: 0.10.1


java.lang.OutOfMemoryError: Java heap space

The code has an unnecessary .collect(), forcing all interaction data into 
memory of the client/driver. Increasing the executor memory will not help with 
this.

remove this line and rebuild Mahout.
https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157

The errant line reads:

interactions.collect()

This forces the user action data into memory, a bad thing for memory 
consumption.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1689) Create a doc on how to write an app that uses Mahout as a lib

2015-04-19 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502271#comment-14502271
 ] 

Pat Ferrel commented on MAHOUT-1689:


[~Andrew_Palumbo], I have the example almost ready, Do you have a page for it? 
Planning an Example to go with the, errr, example. It will be one of the github 
downloads into the Examples directory. Also will create an mscala as another 
way to run.

> Create a doc on how to write an app that uses Mahout as a lib
> -
>
> Key: MAHOUT-1689
> URL: https://issues.apache.org/jira/browse/MAHOUT-1689
> Project: Mahout
>  Issue Type: Documentation
>Affects Versions: 0.10.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.11.0
>
>
> Create a doc on how to write an app that uses Mahout as a lib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1685) Move Mahout shell to Spark 1.3+

2015-04-13 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492498#comment-14492498
 ] 

Pat Ferrel commented on MAHOUT-1685:


If you read between the lines of Sean's reply, he is saying none of it is meant 
to be a supported "API". Which I take to mean they give no indication of change 
or deprecation (rather obvious). They have no intent to make it public again so 
If we don't work around we'll have to petition for a supported API.

Some obvious solutions, without looking too deeply:
1) create our own shell from the Scala REPL, maybe using the Spark's shell as a 
template. Pro is we depend on Scala REPL + Spark supported APIS. Downside is 
that this is a much bigger chunk of code than current shell.
2) Can we turn the shell into a .mscala type scala-as-script extension to the 
Spark shell? This would obviously require a lot of imports and the compile 
delay at every load. Upside is that it goes through supported APIs that are 
less likely to change. Downside is little control over initialization of the 
context and kryo.
3) Petition them to support the API we use. This is by far the easiest and 
seems like it might be worth writing a Jira in Spark if only to  get their 
response.

> Move Mahout shell to Spark 1.3+
> ---
>
> Key: MAHOUT-1685
> URL: https://issues.apache.org/jira/browse/MAHOUT-1685
> Project: Mahout
>  Issue Type: Improvement
>  Components: Mahout spark shell
>Reporter: Pat Ferrel
>Assignee: Dmitriy Lyubimov
>Priority: Critical
> Fix For: 0.11.0
>
> Attachments: mahout-shell-spark-1.3-errors.txt
>
>
> Building for Spark 1.3 we found several important APIS used by the shell are 
> now marked package private in Spark, making them inaccessible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1689) Create a doc on how to write an app that uses Mahout as a lib

2015-04-12 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491795#comment-14491795
 ] 

Pat Ferrel commented on MAHOUT-1689:


This will be an example of using cooccurrence on many inputs. The CLI supports 
only 2. Will try to do it as a project and as an mscala file

> Create a doc on how to write an app that uses Mahout as a lib
> -
>
> Key: MAHOUT-1689
> URL: https://issues.apache.org/jira/browse/MAHOUT-1689
> Project: Mahout
>  Issue Type: Documentation
>Affects Versions: 0.10.0
>Reporter: Andrew Palumbo
>Assignee: Pat Ferrel
> Fix For: 0.11.0
>
>
> Create a doc on how to write an app that uses Mahout as a lib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1685) Move Mahout shell to Spark 1.3+

2015-04-12 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491790#comment-14491790
 ] 

Pat Ferrel commented on MAHOUT-1685:


Should we ask Spark why this needs to be private? I wonder if [~sowen] knows? 
Sean, this is the Mahout extended Spark REPL, the APIs it needs are now private.

> Move Mahout shell to Spark 1.3+
> ---
>
> Key: MAHOUT-1685
> URL: https://issues.apache.org/jira/browse/MAHOUT-1685
> Project: Mahout
>  Issue Type: Improvement
>  Components: Mahout spark shell
>Reporter: Pat Ferrel
>Assignee: Dmitriy Lyubimov
>Priority: Critical
> Fix For: 0.11.0
>
> Attachments: mahout-shell-spark-1.3-errors.txt
>
>
> Building for Spark 1.3 we found several important APIS used by the shell are 
> now marked package private in Spark, making them inaccessible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1685) Move Mahout shell to Spark 1.3+

2015-04-12 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491788#comment-14491788
 ] 

Pat Ferrel commented on MAHOUT-1685:


this is shell specific, we probably fixed the rest but since the shell doesn't 
compile we haven't tested other parts

> Move Mahout shell to Spark 1.3+
> ---
>
> Key: MAHOUT-1685
> URL: https://issues.apache.org/jira/browse/MAHOUT-1685
> Project: Mahout
>  Issue Type: Improvement
>  Components: Mahout spark shell
>Reporter: Pat Ferrel
>Assignee: Dmitriy Lyubimov
>Priority: Critical
> Fix For: 0.11.0
>
> Attachments: mahout-shell-spark-1.3-errors.txt
>
>
> Building for Spark 1.3 we found several important APIS used by the shell are 
> now marked package private in Spark, making them inaccessible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1685) Move Mahout shell to Spark 1.3+

2015-04-12 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491623#comment-14491623
 ] 

Pat Ferrel commented on MAHOUT-1685:


[~Andrew_Palumbo] can you attach the errors you saw here? IMO we really need to 
get the shell working, it's a big feature and the distros are already on 1.2. 
By the time we get 0.10.1 out they may be on 1.4. We definitely don't want to 
drop the shell.

> Move Mahout shell to Spark 1.3+
> ---
>
> Key: MAHOUT-1685
> URL: https://issues.apache.org/jira/browse/MAHOUT-1685
> Project: Mahout
>  Issue Type: Bug
>  Components: Mahout spark shell
>Affects Versions: 0.10.1
>Reporter: Pat Ferrel
>Assignee: Dmitriy Lyubimov
>Priority: Critical
> Fix For: 0.10.1
>
>
> Building for Spark 1.3 we found several important APIS used by the shell are 
> now marked package private in Spark, making them inaccessible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1685) Move Mahout shell to Spark 1.3+

2015-04-12 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1685:
--

 Summary: Move Mahout shell to Spark 1.3+
 Key: MAHOUT-1685
 URL: https://issues.apache.org/jira/browse/MAHOUT-1685
 Project: Mahout
  Issue Type: Bug
  Components: Mahout spark shell
Affects Versions: 0.10.1
Reporter: Pat Ferrel
Assignee: Dmitriy Lyubimov
Priority: Critical
 Fix For: 0.10.1


Building for Spark 1.3 we found several important APIS used by the shell are 
now marked package private in Spark, making them inaccessible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1679) example script run-item-sim should work on hdfs as well as local

2015-04-09 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1679:
--

 Summary: example script run-item-sim should work on hdfs as well 
as local
 Key: MAHOUT-1679
 URL: https://issues.apache.org/jira/browse/MAHOUT-1679
 Project: Mahout
  Issue Type: Bug
  Components: Examples
Affects Versions: 0.10.0
Reporter: Pat Ferrel
Assignee: Pat Ferrel
Priority: Minor
 Fix For: 0.10.1


mahout/examples/bin/run-item-sim does not run on a cluster or pseudo-cluster 
Spark + HDFS

It prints a warning and how to run in cluster but should just work in either 
mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1678) Hadoop 1 build broken

2015-04-08 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1678:
--

 Summary: Hadoop 1 build broken
 Key: MAHOUT-1678
 URL: https://issues.apache.org/jira/browse/MAHOUT-1678
 Project: Mahout
  Issue Type: Bug
  Components: Integration
Affects Versions: 0.10.0
Reporter: Pat Ferrel
Assignee: Suneel Marthi
Priority: Blocker
 Fix For: 0.10.0


building for H1 got error below, which blocks build tests for H1

 T E S T S
---
Running org.apache.mahout.clustering.TestClusterDumper
Running org.apache.mahout.clustering.TestClusterEvaluator
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.033 sec - in 
org.apache.mahout.clustering.TestClusterDumper
Running org.apache.mahout.clustering.cdbw.TestCDbwEvaluator
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 10.089 sec - 
in org.apache.mahout.clustering.cdbw.TestCDbwEvaluator
Running 
org.apache.mahout.cf.taste.impl.similarity.jdbc.MySQLJDBCInMemoryItemSimilarityTest
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.701 sec - in 
org.apache.mahout.cf.taste.impl.similarity.jdbc.MySQLJDBCInMemoryItemSimilarityTest
Running org.apache.mahout.text.LuceneStorageConfigurationTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.903 sec - in 
org.apache.mahout.text.LuceneStorageConfigurationTest
Running org.apache.mahout.text.LuceneSegmentInputSplitTest
Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 3.552 sec <<< 
FAILURE! - in org.apache.mahout.text.LuceneSegmentInputSplitTest
testGetSegment(org.apache.mahout.text.LuceneSegmentInputSplitTest)  Time 
elapsed: 2.248 sec  <<< ERROR!
java.lang.NoSuchMethodError: 
org.apache.hadoop.fs.FileSystem.newInstance(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/fs/FileSystem;
at 
__randomizedtesting.SeedInfo.seed([B6AAF6EC1A001636:33AA49EC475E421B]:0)
at 
org.apache.solr.store.hdfs.HdfsDirectory.(HdfsDirectory.java:58)
at 
org.apache.mahout.text.LuceneSegmentInputSplit.getSegment(LuceneSegmentInputSplit.java:92)
at 
org.apache.mahout.text.LuceneSegmentInputSplitTest.assertSegmentContainsOneDoc(LuceneSegmentInputSplitTest.java:81)
at 
org.apache.mahout.text.LuceneSegmentInputSplitTest.testGetSegment(LuceneSegmentInputSplitTest.java:59)

testGetSegmentNonExistingSegment(org.apache.mahout.text.LuceneSegmentInputSplitTest)
  Time elapsed: 0.958 sec  <<< ERROR!
java.lang.NoSuchMethodError: 
org.apache.hadoop.fs.FileSystem.newInstance(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/fs/FileSystem;
at 
__randomizedtesting.SeedInfo.seed([B6AAF6EC1A001636:F16E11692CC0C088]:0)
at 
org.apache.solr.store.hdfs.HdfsDirectory.(HdfsDirectory.java:58)
at 
org.apache.mahout.text.LuceneSegmentInputSplit.getSegment(LuceneSegmentInputSplit.java:92)
at 
org.apache.mahout.text.LuceneSegmentInputSplitTest.testGetSegmentNonExistingSegment(LuceneSegmentInputSplitTest.java:76)

Running org.apache.mahout.text.SequenceFilesFromLuceneStorageTest
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 29.904 sec - 
in org.apache.mahout.clustering.TestClusterEvaluator
Running org.apache.mahout.text.LuceneSegmentRecordReaderTest
Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 5.239 sec <<< 
FAILURE! - in org.apache.mahout.text.LuceneSegmentRecordReaderTest
testNonExistingIdField(org.apache.mahout.text.LuceneSegmentRecordReaderTest)  
Time elapsed: 2.588 sec  <<< ERROR!
java.lang.NoSuchMethodError: 
org.apache.hadoop.fs.FileSystem.newInstance(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/fs/FileSystem;
at 
__randomizedtesting.SeedInfo.seed([BE4E63CDB556DEFF:25483164126E6A9]:0)
at 
org.apache.solr.store.hdfs.HdfsDirectory.(HdfsDirectory.java:58)
at 
org.apache.mahout.text.LuceneSegmentInputSplit.getSegment(LuceneSegmentInputSplit.java:92)
at 
org.apache.mahout.text.LuceneSegmentRecordReader.initialize(LuceneSegmentRecordReader.java:55)
at 
org.apache.mahout.text.LuceneSegmentRecordReaderTest.testNonExistingIdField(LuceneSegmentRecordReaderTest.java:93)

testNonExistingField(org.apache.mahout.text.LuceneSegmentRecordReaderTest)  
Time elapsed: 1.188 sec  <<< ERROR!
java.lang.NoSuchMethodError: 
org.apache.hadoop.fs.FileSystem.newInstance(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/fs/FileSystem;
at 
__randomizedtesting.SeedInfo.seed([BE4E63CDB556DEFF:4252F6007B6F27B1]:0)
at 
org.apache.solr.store.hdfs.HdfsDirectory.(HdfsDirectory.java:58)
at 
org.apache.mahout.text.LuceneSegmentInputSplit.getSegment(LuceneSegmentInputSplit.java:92)
at 
org.apache.mahout.text.LuceneSegmentRecordReader.initialize(LuceneSegmentRecordReader.java:55)
at 
org.apache.ma

[jira] [Resolved] (MAHOUT-1674) A'A fails getting with an index out of range for a row vector

2015-04-07 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1674.

Resolution: Fixed
  Assignee: Pat Ferrel  (was: Dmitriy Lyubimov)

Made change to blas that catch this case, passes one user's test that I was 
able to reporduce.

> A'A fails getting with an index out of range for a row vector
> -
>
> Key: MAHOUT-1674
> URL: https://issues.apache.org/jira/browse/MAHOUT-1674
> Project: Mahout
>  Issue Type: Bug
>  Components: s
>Affects Versions: 0.10.0
>Reporter: Pat Ferrel
>Assignee: Pat Ferrel
>Priority: Critical
> Fix For: 0.10.0
>
>
> A'A and possibly A'B can fail with an index out of bounds on the row vector. 
> This seems related to partitioning where some partitions may be empty.
> This can be reproduce with the attached data as input into 
> spark-itemsimilarity. This is only A data and the one large csv will complete 
> correctly but passing in the directory of part files will exhibit the error. 
> The data is identical except in the number of files that are used to contain 
> the data.
> The error occurs using the local raw filesystem and with master = local and 
> is pretty fast to reach. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1512) Hadoop 2 compatibility

2015-04-05 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396317#comment-14396317
 ] 

Pat Ferrel commented on MAHOUT-1512:


Was there work done recently I failed on 2.6 last Friday 2-3-2015. If someone 
has a known good install of 2.6 on a pseudo cluster or better I can provide a 
simple test.

> Hadoop 2 compatibility
> --
>
> Key: MAHOUT-1512
> URL: https://issues.apache.org/jira/browse/MAHOUT-1512
> Project: Mahout
>  Issue Type: Task
>Reporter: Sebastian Schelter
>Assignee: Suneel Marthi
>Priority: Critical
>  Labels: legacy, scala
> Fix For: 0.10.0
>
>
> We must ensure that all our MR code also runs on Hadoop 2. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1588) Multiple input path support in recommendation job

2015-04-05 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1588:
---
Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

> Multiple input path support in recommendation job
> -
>
> Key: MAHOUT-1588
> URL: https://issues.apache.org/jira/browse/MAHOUT-1588
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Xiaomeng Huang
>Assignee: Pat Ferrel
>Priority: Minor
>  Labels: legacy
> Fix For: 0.10.0
>
> Attachments: Mahout-1588.000.patch
>
>
> Now recommendation job can only import a input path via "--input", and can't 
> load file from different path. Customers may put preference data in different 
> path. This is a very usual scenario.
> I add a option named "--multiInput(-mi)", and don't remove the original input 
> option. These two input option can set together. And the modification only 
> refer to  PreparePreferenceMatrixJob, which load data from filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1588) Multiple input path support in recommendation job

2015-04-05 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396298#comment-14396298
 ] 

Pat Ferrel commented on MAHOUT-1588:


Does this work for all recommender CLIs?

The new spark-itemsimilarity already have a flexible method for passing in 
multiple directories and files even supporting recursive regex discovery of 
input. 

This is too large for 0.10.0 and may not be of enough importance to test for 
later release.

If the contributor feels this is important, please create a PR and include 
tests.

> Multiple input path support in recommendation job
> -
>
> Key: MAHOUT-1588
> URL: https://issues.apache.org/jira/browse/MAHOUT-1588
> Project: Mahout
>  Issue Type: New Feature
>  Components: Collaborative Filtering
>Affects Versions: 0.9
>Reporter: Xiaomeng Huang
>Assignee: Pat Ferrel
>Priority: Minor
>  Labels: legacy
> Fix For: 0.10.0
>
> Attachments: Mahout-1588.000.patch
>
>
> Now recommendation job can only import a input path via "--input", and can't 
> load file from different path. Customers may put preference data in different 
> path. This is a very usual scenario.
> I add a option named "--multiInput(-mi)", and don't remove the original input 
> option. These two input option can set together. And the modification only 
> refer to  PreparePreferenceMatrixJob, which load data from filesystem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1674) A'A fails getting with an index out of range for a row vector

2015-04-05 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396290#comment-14396290
 ] 

Pat Ferrel commented on MAHOUT-1674:


[~dlie...@gmail.com] will not be able to fix this until 0.10.1 so [~pferrel] is 
looking for some fix guidance for a short term work around.

The reason this is hard to ignore is that two users are gathering data with 
Spark streaming, which tends to create lots of small files, and they have run 
into this error. Kafka (or other) to Spark Streaming will be an increasingly 
popular method for input to cooccurrence calculation.

The only known workaround is to concatenated input files before reading them 
into Mahout. This has been verified in only one case.



> A'A fails getting with an index out of range for a row vector
> -
>
> Key: MAHOUT-1674
> URL: https://issues.apache.org/jira/browse/MAHOUT-1674
> Project: Mahout
>  Issue Type: Bug
>  Components: s
>Affects Versions: 0.10.0
>Reporter: Pat Ferrel
>Assignee: Dmitriy Lyubimov
>Priority: Critical
> Fix For: 0.10.0
>
>
> A'A and possibly A'B can fail with an index out of bounds on the row vector. 
> This seems related to partitioning where some partitions may be empty.
> This can be reproduce with the attached data as input into 
> spark-itemsimilarity. This is only A data and the one large csv will complete 
> correctly but passing in the directory of part files will exhibit the error. 
> The data is identical except in the number of files that are used to contain 
> the data.
> The error occurs using the local raw filesystem and with master = local and 
> is pretty fast to reach. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1674) A'A fails getting with an index out of range for a row vector

2015-04-05 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1674:
--

 Summary: A'A fails getting with an index out of range for a row 
vector
 Key: MAHOUT-1674
 URL: https://issues.apache.org/jira/browse/MAHOUT-1674
 Project: Mahout
  Issue Type: Bug
  Components: s
Affects Versions: 0.10.0
Reporter: Pat Ferrel
Assignee: Dmitriy Lyubimov
Priority: Critical
 Fix For: 0.10.0


A'A and possibly A'B can fail with an index out of bounds on the row vector. 
This seems related to partitioning where some partitions may be empty.

This can be reproduce with the attached data as input into 
spark-itemsimilarity. This is only A data and the one large csv will complete 
correctly but passing in the directory of part files will exhibit the error. 
The data is identical except in the number of files that are used to contain 
the data.

The error occurs using the local raw filesystem and with master = local and is 
pretty fast to reach. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1655) Refactor module dependencies

2015-04-05 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1655.

Resolution: Fixed

finished refactoring. OIUtils seems mostly anachronist. The only thing used 
currently in Scala must be the vector writable to vector conversion and that 
might be replace with a couple lines of Scala but the class is small so not a 
big deal.

> Refactor module dependencies
> 
>
> Key: MAHOUT-1655
> URL: https://issues.apache.org/jira/browse/MAHOUT-1655
> Project: Mahout
>  Issue Type: Improvement
>  Components: mrlegacy
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Andrew Musselman
>Priority: Critical
> Fix For: 0.10.0
>
>
> Make a new module, call it mahout-hadoop. Move anything there that is 
> currently in mrlegacy but used in math-scala or spark. Remove dependencies on 
> mrlegacy altogether if possible by using other core classes.
> The goal is to have math-scala and spark module depend on math, and a small 
> module called mahout-hadoop (much smaller than mrlegacy). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MAHOUT-1646) Refactor out all possible mrlegacy dependencies from Scala code

2015-04-05 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel resolved MAHOUT-1646.

Resolution: Duplicate

duplicate of MAHOUT-1655

> Refactor out all possible mrlegacy dependencies from Scala code
> ---
>
> Key: MAHOUT-1646
> URL: https://issues.apache.org/jira/browse/MAHOUT-1646
> Project: Mahout
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 0.9
>Reporter: Pat Ferrel
>Assignee: Dmitriy Lyubimov
> Fix For: 0.10.1
>
>
> Scala/Spark code depends on the mrlegacy module even though very few things 
> are really used. move those needed pieces to math so as to remove this 
> dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1662) Potential Path bug in SequenceFileVaultIterator breaks DisplaySpectralKMeans

2015-04-02 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393418#comment-14393418
 ] 

Pat Ferrel commented on MAHOUT-1662:


I'm getting the "wrong FS" error with spark-itemsimilarity on hadoop 2.6 + 
spark 1.1.0 + yarn any relation? I have hdfs running, can see the input file 
with "hadoop fs -ls /input" and in the hadoop gui but get a wrong FS error when 
getting a file status in the code.

> Potential Path bug in SequenceFileVaultIterator breaks DisplaySpectralKMeans
> 
>
> Key: MAHOUT-1662
> URL: https://issues.apache.org/jira/browse/MAHOUT-1662
> Project: Mahout
>  Issue Type: Bug
>  Components: Examples, mrlegacy
>Affects Versions: 0.9
>Reporter: Shannon Quinn
>Assignee: Shannon Quinn
> Fix For: 0.10.0
>
>
> Received the following error when attempting to run DisplaySpectralKMeans:
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
> file://tmp/calculations/diagonal/part-r-0/tmp/calculations/diagonal/part-r-0,
>  expected: file:///
>   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1750)
>   at 
> org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1774)
>   at 
> org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.(SequenceFileValueIterator.java:56)
>   at 
> org.apache.mahout.clustering.spectral.VectorCache.load(VectorCache.java:115)
>   at 
> org.apache.mahout.clustering.spectral.MatrixDiagonalizeJob.runJob(MatrixDiagonalizeJob.java:77)
>   at 
> org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:170)
>   at 
> org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:117)
>   at 
> org.apache.mahout.clustering.display.DisplaySpectralKMeans.main(DisplaySpectralKMeans.java:76)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
> Tracked the origin of the bug to line 54 of SequenceFileVaultIterator. PR 
> which contains a fix is available; I would ask for independent verification 
> before merging it with master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1618) Cooccurrence Recommender example and documentation

2015-04-01 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391437#comment-14391437
 ] 

Pat Ferrel commented on MAHOUT-1618:


No, this is a full blown example of integration with Solr and is a fairly big 
project. The doc I was refering to is a simple actual quickstart and should be 
no more than a page. 

> Cooccurrence Recommender example and documentation 
> ---
>
> Key: MAHOUT-1618
> URL: https://issues.apache.org/jira/browse/MAHOUT-1618
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: cooccurrence
>Reporter: Thejas Prasad
>Assignee: Pat Ferrel
>Priority: Trivial
>  Labels: DSL, cooccurence, scala, spark
> Fix For: 0.10.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MAHOUT-1667) Support Hadoop 1.2.1 in poms

2015-04-01 Thread Pat Ferrel (JIRA)
Pat Ferrel created MAHOUT-1667:
--

 Summary: Support Hadoop 1.2.1 in poms
 Key: MAHOUT-1667
 URL: https://issues.apache.org/jira/browse/MAHOUT-1667
 Project: Mahout
  Issue Type: Bug
  Components: build
Affects Versions: 0.10.0
Reporter: Pat Ferrel
Assignee: Suneel Marthi
Priority: Critical
 Fix For: 0.10.0


Need to support build for Hadoop 1.2.1 with the hadoop1 profile in poms. Errors 
for non-existent artifacts appear when running: "mvn -Phadoop1 
-Dhadoop.version=1.2.1 clean install" with hadoop-auth, which does not exist 
for hadoop 1.2.1, along with hadoop-yarn and several other artifacts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAHOUT-1618) Cooccurrence Recommender example and documentation

2015-04-01 Thread Pat Ferrel (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391156#comment-14391156
 ] 

Pat Ferrel commented on MAHOUT-1618:


skeleton code written, will have to be after 0.10.0 before it is added to the 
site.

> Cooccurrence Recommender example and documentation 
> ---
>
> Key: MAHOUT-1618
> URL: https://issues.apache.org/jira/browse/MAHOUT-1618
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: cooccurrence
>Reporter: Thejas Prasad
>Assignee: Pat Ferrel
>Priority: Trivial
>  Labels: DSL, cooccurence, scala, spark
> Fix For: 0.10.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAHOUT-1589) mahout.cmd has duplicated content

2015-04-01 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pat Ferrel updated MAHOUT-1589:
---
Resolution: Fixed
Status: Resolved  (was: Patch Available)

mahout.cmd prints a deprecation warning when run.

> mahout.cmd has duplicated content
> -
>
> Key: MAHOUT-1589
> URL: https://issues.apache.org/jira/browse/MAHOUT-1589
> Project: Mahout
>  Issue Type: Bug
>  Components: CLI
>Affects Versions: 0.9
> Environment: Windows
>Reporter: Venkat Ranganathan
>Assignee: Pat Ferrel
>  Labels: legacy, scala
> Fix For: 0.10.0
>
> Attachments: MAHOUT-1589.patch
>
>
> bin/mahout.cmd has duplicated contents.   Need to trim it



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (MAHOUT-1618) Cooccurrence Recommender example and documentation

2015-04-01 Thread Pat Ferrel (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on MAHOUT-1618 started by Pat Ferrel.
--
> Cooccurrence Recommender example and documentation 
> ---
>
> Key: MAHOUT-1618
> URL: https://issues.apache.org/jira/browse/MAHOUT-1618
> Project: Mahout
>  Issue Type: Documentation
>  Components: Examples
>Affects Versions: cooccurrence
>Reporter: Thejas Prasad
>Assignee: Pat Ferrel
>Priority: Trivial
>  Labels: DSL, cooccurence, scala, spark
> Fix For: 0.10.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   >