[jira] [Created] (MAHOUT-2048) There are duplicate content pages which need redirects instead
Pat Ferrel created MAHOUT-2048: -- Summary: There are duplicate content pages which need redirects instead Key: MAHOUT-2048 URL: https://issues.apache.org/jira/browse/MAHOUT-2048 Project: Mahout Issue Type: Planned Work Components: website Affects Versions: 0.13.0 Reporter: Pat Ferrel Assignee: Andrew Musselman Fix For: 0.13.0 I have duplicated content in 3 places in the `website/` directory. We need to have one place for the real content and replace the dups with redirect to the actual content. This looks like is may be true for several other pages and honestly I'm not sure if they are all needed but there are many links out in the wild that point to the old path for the CCO recommender pages so we should do this for the ones below at least. Better yet we may want to clean out any other dups unless someone knows why not. TLDR; Actual content: mahout/website/docs/latest/algorithms/recommenders/index.md mahout/website/docs/latest/algorithms/recommenders/cco.md Dups to be replaced with redirects to the above content. I vaguely remember all these different site structures so there may be links to them in the wild. mahout/website/recommender-overview.md => mahout/website/docs/latest/algorithms/recommenders/index.md mahout/website/users/algorithms/intro-cooccurrence-spark.md => mahout/website/docs/latest/algorithms/recommenders/cco.md mahout/website/users/recommender/quickstart.md => mahout/website/docs/latest/algorithms/recommenders/index.md mahout/website/users/recommender/intro-cooccurrence-spark.md => mahout/website/docs/latest/algorithms/recommenders/cco.md -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MAHOUT-2023) Drivers broken, scopt classes not found
Pat Ferrel created MAHOUT-2023: -- Summary: Drivers broken, scopt classes not found Key: MAHOUT-2023 URL: https://issues.apache.org/jira/browse/MAHOUT-2023 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.13.1 Environment: any Reporter: Pat Ferrel Assignee: Pat Ferrel Priority: Blocker Fix For: 0.13.1 Type `mahout spark-itemsimilarity` after Mahout is installed properly and you get a fatal exception due to missing scopt classes. Probably a build issue related to incorrect versions of scopt being looked for. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MAHOUT-2020) Maven repo structure compatibility with SBT
Pat Ferrel created MAHOUT-2020: -- Summary: Maven repo structure compatibility with SBT Key: MAHOUT-2020 URL: https://issues.apache.org/jira/browse/MAHOUT-2020 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.13.1 Environment: Creating a project from maven built Mahout using sbt. Made critical since it seems to block using Mahout with sbt. At least I have found no way to do it. Reporter: Pat Ferrel Assignee: Trevor Grant Priority: Critical Fix For: 0.13.1 The maven repo should build: org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar substitute Spark version for -2.1, so -1.6 etc. The build.sbt `libraryDependencies` line then will be: `"org.apache.mahout" %% "mahout-spark-2.1" % “0.13.1-SNAPSHOT` This is parsed by sbt to yield the path of : org/apache/mahout/mahout-spark-2.1/0.13.1-SNAPSHOT/mahout-spark-2.1_2.11-0.13.1-SNAPSHOT.jar The outcome of `mvn clean install` currently is something like: org/apache/mahout/mahout-spark/0.13.1-SNAPSHOT/mahout-spark-0.13.1-SNAPSHOT-spark_2.1.jar This has no effect on the package structure, only artifact naming and maven repo structure. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MAHOUT-2019) SparseRowMatrix assign ops user for loops instead of iterateNonZero and so can be optimized
Pat Ferrel created MAHOUT-2019: -- Summary: SparseRowMatrix assign ops user for loops instead of iterateNonZero and so can be optimized Key: MAHOUT-2019 URL: https://issues.apache.org/jira/browse/MAHOUT-2019 Project: Mahout Issue Type: Bug Components: Math Affects Versions: 0.13.0 Reporter: Pat Ferrel Assignee: Pat Ferrel Fix For: 0.13.1 DRMs get blockified into SparseRowMatrix instances if the density is low. But SRM inherits the implementation of method like "assign" from AbstractMatrix, which uses nest for loops to traverse rows. For multiplying 2 matrices that are extremely sparse, the kind if data you see in collaborative filtering, this is extremely wasteful of execution time. Better to use a sparse vector's iterateNonZero Iterator for some function types. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MAHOUT-1951) Drivers don't run with remote Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1951: --- The jar isn't supposed to have all deps, only the ones not provided by the environment. In fact it is supposed to have the minimum. So it appears some of the provided classes for previous platforms (Spark etc) have change in new versions? We then need to add to the dependency reduced jar but first check to see if a newer version of some provided dep will fill the bill or dependency-reduced will bloat needlessly. What specifically is the error, what is missing. > Drivers don't run with remote Spark > --- > > Key: MAHOUT-1951 > URL: https://issues.apache.org/jira/browse/MAHOUT-1951 > Project: Mahout > Issue Type: Bug > Components: Classification, CLI, Collaborative Filtering >Affects Versions: 0.13.0 > Environment: The command line drivers spark-itemsimilarity and > spark-naivebayes using a remote or pseudo-clustered Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Blocker > Fix For: 0.13.0 > > > Missing classes when running these jobs because the dependencies-reduced jar, > passed to Spark for serialization purposes, does not contain all needed > classes. > Found by a user. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MAHOUT-1988) scala 2.10 is hardcoded somewhere
[ https://issues.apache.org/jira/browse/MAHOUT-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16037567#comment-16037567 ] Pat Ferrel commented on MAHOUT-1988: Don't have time to look now but believe Scopt may hardcode 2.10. I know and use the 2.11 version and it is very little changed so putting one in the scala 2.10 profile and another in the 2.11 should be simple, no? > scala 2.10 is hardcoded somewhere > -- > > Key: MAHOUT-1988 > URL: https://issues.apache.org/jira/browse/MAHOUT-1988 > Project: Mahout > Issue Type: Bug >Affects Versions: 0.13.0 >Reporter: Andrew Palumbo >Priority: Blocker > Fix For: 0.13.1 > > > After building mahout against scala 2.11: > {code} > mvn clean install -Dscala.version=2.11.4 -Dscala.compat.version=2.11 > -Phadoop2 -DskipTests > {code} > ViennaCL jars are built hard-coded to scala 2.10. This is currently blocking > the 0.13.1 release. > {code} > mahout-h2o_2.11-0.13.1-SNAPSHOT.jar > mahout-hdfs-0.13.1-SNAPSHOT.jar > mahout-math-0.13.1-SNAPSHOT.jar > mahout-math-scala_2.11-0.13.1-SNAPSHOT.jar > mahout-mr-0.13.1-SNAPSHOT.jar > mahout-native-cuda_2.10-0.13.0-SNAPSHOT.jar > mahout-native-cuda_2.10-0.13.1-SNAPSHOT.jar > mahout-native-viennacl_2.10-0.13.1-SNAPSHOT.jar > mahout-native-viennacl-omp_2.10-0.13.1-SNAPSHOT.jar > mahout-spark_2.11-0.13.1-SNAPSHOT-dependency-reduced.jar > mahout-spark_2.11-0.13.1-SNAPSHOT.jar > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903739#comment-15903739 ] Pat Ferrel commented on MAHOUT-1951: Oops misnamed the commit message for MAHOUT-1950. The fix is in master, unit tested, and driver integration tested on remote spark and HDFS. Just removed line 83 in mahout/spark/src/main/scala/org/apache/mahout/sparkbindings/package.scala Needs to be tested thoroughly since I have no idea of the ramifications of removing the line. See [~Andrew_Palumbo] who added it but can't recall the reason either. Cross your fingers. > Drivers don't run with remote Spark > --- > > Key: MAHOUT-1951 > URL: https://issues.apache.org/jira/browse/MAHOUT-1951 > Project: Mahout > Issue Type: Bug > Components: Classification, CLI, Collaborative Filtering >Affects Versions: 0.13.0 > Environment: The command line drivers spark-itemsimilarity and > spark-naivebayes using a remote or pseudo-clustered Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Blocker > Fix For: 0.13.0 > > > Missing classes when running these jobs because the dependencies-reduced jar, > passed to Spark for serialization purposes, does not contain all needed > classes. > Found by a user. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (MAHOUT-1951) Drivers don't run with remote Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel resolved MAHOUT-1951. Resolution: Fixed Test thoroughly, not sure of side effects of the fix > Drivers don't run with remote Spark > --- > > Key: MAHOUT-1951 > URL: https://issues.apache.org/jira/browse/MAHOUT-1951 > Project: Mahout > Issue Type: Bug > Components: Classification, CLI, Collaborative Filtering >Affects Versions: 0.13.0 > Environment: The command line drivers spark-itemsimilarity and > spark-naivebayes using a remote or pseudo-clustered Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Blocker > Fix For: 0.13.0 > > > Missing classes when running these jobs because the dependencies-reduced jar, > passed to Spark for serialization purposes, does not contain all needed > classes. > Found by a user. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903327#comment-15903327 ] Pat Ferrel commented on MAHOUT-1951: [~Andrew_Palumbo] [~smarthi] There seems to be some question about who made the commit but I'm sure it wasn't me. I have no idea what is causing this, as I said, and the only thing suspicious in it (the one before works BTW) is the mahout jars line change in the Spark module. the rest of the changes are in Flink afaict. > Drivers don't run with remote Spark > --- > > Key: MAHOUT-1951 > URL: https://issues.apache.org/jira/browse/MAHOUT-1951 > Project: Mahout > Issue Type: Bug > Components: Classification, CLI, Collaborative Filtering >Affects Versions: 0.13.0 > Environment: The command line drivers spark-itemsimilarity and > spark-naivebayes using a remote or pseudo-clustered Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Blocker > Fix For: 0.13.0 > > > Missing classes when running these jobs because the dependencies-reduced jar, > passed to Spark for serialization purposes, does not contain all needed > classes. > Found by a user. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903320#comment-15903320 ] Pat Ferrel commented on MAHOUT-1951: A quick way to test this is: 1) get Spark and HDFS running locally in pseudo-cluster mode. 2) build the version of Mahout under test I use simply "mvn clean install -DskipTests" 3) "hdfs dfs -rm -r test-results" to removes any old results 4) run the script below and look for exceptions in the output, they will look like the above errors #!/usr/bin/env bash #begin script mahout spark-itemsimilarity \ --input test.csv \ --output test-result \ --master spark://Maclaurin.local:7077 \ --filter1 purchase \ --filter2 view \ --itemIDColumn 2 \ --rowIDColumn 0 \ --filterColumn 1 #end-script test.csv file for the script u1,purchase,iphone u1,purchase,ipad u2,purchase,nexus u2,purchase,galaxy u3,purchase,surface u4,purchase,iphone u4,purchase,galaxy u1,view,iphone u1,view,ipad u1,view,nexus u1,view,galaxy u2,view,iphone u2,view,ipad u2,view,nexus u2,view,galaxy u3,view,surface u3,view,nexus u4,view,iphone u4,view,ipad > Drivers don't run with remote Spark > --- > > Key: MAHOUT-1951 > URL: https://issues.apache.org/jira/browse/MAHOUT-1951 > Project: Mahout > Issue Type: Bug > Components: Classification, CLI, Collaborative Filtering >Affects Versions: 0.13.0 > Environment: The command line drivers spark-itemsimilarity and > spark-naivebayes using a remote or pseudo-clustered Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Blocker > Fix For: 0.13.0 > > > Missing classes when running these jobs because the dependencies-reduced jar, > passed to Spark for serialization purposes, does not contain all needed > classes. > Found by a user. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15903315#comment-15903315 ] Pat Ferrel commented on MAHOUT-1951: scratch that PR. We do not have a fix for this but I have narrowed down the commit where it first starts to occur. in Mahout 0.12.2 the drivers work with remote Spark They work in all commits until https://github.com/apache/mahout/commit/8e0e8b5572e0d24c1930ed60fec6d02693b41575 which would say that something in this commit broke things. This is mainly Flink but there is a change to how mahout jars are packaged and the error is shown below. The error wording is a bit mysterious, it seem to be missing MahoutKryoRegistrator but could also be from a class that cannot be serialized, really not sure. Exception in thread "main" 17/03/06 18:15:04 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.0.6): java.io.IOException: org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1212) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:128) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:273) at org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:258) at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:174) at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:215) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1205) ... 11 more Caused by: java.lang.ClassNotFoundException: org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$5.apply(KryoSerializer.scala:123) at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$5.apply(KryoSerializer.scala:123) at scala.Option.map(Option.scala:145) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:123) ... 17 more Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.
[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897636#comment-15897636 ] Pat Ferrel commented on MAHOUT-1951: fix being tested in https://github.com/apache/mahout/pull/292 > Drivers don't run with remote Spark > --- > > Key: MAHOUT-1951 > URL: https://issues.apache.org/jira/browse/MAHOUT-1951 > Project: Mahout > Issue Type: Bug > Components: Classification, CLI, Collaborative Filtering >Affects Versions: 0.13.0 > Environment: The command line drivers spark-itemsimilarity and > spark-naivebayes using a remote or pseudo-clustered Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Blocker > Fix For: 0.13.0 > > > Missing classes when running these jobs because the dependencies-reduced jar, > passed to Spark for serialization purposes, does not contain all needed > classes. > Found by a user. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MAHOUT-1952) Allow pass-through of params for driver's CLI to spark-submit
Pat Ferrel created MAHOUT-1952: -- Summary: Allow pass-through of params for driver's CLI to spark-submit Key: MAHOUT-1952 URL: https://issues.apache.org/jira/browse/MAHOUT-1952 Project: Mahout Issue Type: New Feature Components: Classification, CLI, Collaborative Filtering Affects Versions: 0.13.0 Environment: CLI drivers launched from mahout script Reporter: Pat Ferrel Assignee: Pat Ferrel Priority: Minor Fix For: 0.13.1 remove driver CLI args that are dups of what spark-submit can do and allow passthrough of arbitrary extra CLI to spar-submit using spark-submit parsing. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MAHOUT-1951) Drivers don't run with remote Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1951: --- Component/s: Collaborative Filtering Classification > Drivers don't run with remote Spark > --- > > Key: MAHOUT-1951 > URL: https://issues.apache.org/jira/browse/MAHOUT-1951 > Project: Mahout > Issue Type: Bug > Components: Classification, CLI, Collaborative Filtering >Affects Versions: 0.13.0 > Environment: The command line drivers spark-itemsimilarity and > spark-naivebayes using a remote or pseudo-clustered Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Blocker > Fix For: 0.13.0 > > > Missing classes when running these jobs because the dependencies-reduced jar, > passed to Spark for serialization purposes, does not contain all needed > classes. > Found by a user. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MAHOUT-1951) Drivers don't run with remote Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15897625#comment-15897625 ] Pat Ferrel commented on MAHOUT-1951: [~rawkintrevo] added the use of spark-submit to the Mahout script for launching the drivers, this potentially has some side effects since much of the work of spark-submit was done in the drivers and I am not sure if there is a way to pass params through to spark-submit. In other words the driver may no permit unrecognized params on the command line. Therefor we will leave the drivers as they are, doing more work than they should but mark this as deprecated and remove in a future release. > Drivers don't run with remote Spark > --- > > Key: MAHOUT-1951 > URL: https://issues.apache.org/jira/browse/MAHOUT-1951 > Project: Mahout > Issue Type: Bug > Components: CLI >Affects Versions: 0.13.0 > Environment: The command line drivers spark-itemsimilarity and > spark-naivebayes using a remote or pseudo-clustered Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Blocker > Fix For: 0.13.0 > > > Missing classes when running these jobs because the dependencies-reduced jar, > passed to Spark for serialization purposes, does not contain all needed > classes. > Found by a user. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MAHOUT-1951) Drivers don't run with remote Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1951: --- User found the following error running the spark-itemsimilarity driver (affect the NB driver too) on a remote Spark master: 17/03/03 10:08:40 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, reco-master): java.io.IOException: org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1212) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165) ... Caused by: java.lang.ClassNotFoundException: org.apache.mahout.sparkbindings.io.MahoutKryoRegistrator at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$5.apply(KryoSerializer.scala:123) at org.apache.spark.serializer.KryoSerializer$$anonfun$newKryo$5.apply(KryoSerializer.scala:123) at scala.Option.map(Option.scala:145) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:123) When I run the exactly same command on the 0.12.2 release distribution against the same Spark cluster, the command completes sucessfully. My Environment is: * Ubuntu 14.04 * Oracle-JDK 1.8.0_121 * Spark standalone cluster using this distribution: http://d3kbcqa49mib13.cloudfront.net/spark-1.6.3-bin-hadoop2.6.tgz * Mahout 0.13.0-RC: https://repository.apache.org/content/repositories/orgapachemahout-1034/org/apache/mahout/apache-mahout-distribution/0.13.0/apache-mahout-distribution-0.13.0.tar.gz > Drivers don't run with remote Spark > --- > > Key: MAHOUT-1951 > URL: https://issues.apache.org/jira/browse/MAHOUT-1951 > Project: Mahout > Issue Type: Bug > Components: CLI >Affects Versions: 0.13.0 > Environment: The command line drivers spark-itemsimilarity and > spark-naivebayes using a remote or pseudo-clustered Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Blocker > Fix For: 0.13.0 > > > Missing classes when running these jobs because the dependencies-reduced jar, > passed to Spark for serialization purposes, does not contain all needed > classes. > Found by a user. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (MAHOUT-1951) Drivers don't run with remote Spark
Pat Ferrel created MAHOUT-1951: -- Summary: Drivers don't run with remote Spark Key: MAHOUT-1951 URL: https://issues.apache.org/jira/browse/MAHOUT-1951 Project: Mahout Issue Type: Bug Components: CLI Affects Versions: 0.13.0 Environment: The command line drivers spark-itemsimilarity and spark-naivebayes using a remote or pseudo-clustered Spark Reporter: Pat Ferrel Assignee: Pat Ferrel Priority: Blocker Fix For: 0.13.0 Missing classes when running these jobs because the dependencies-reduced jar, passed to Spark for serialization purposes, does not contain all needed classes. Found by a user. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MAHOUT-1951) Drivers don't run with remote Spark
[ https://issues.apache.org/jira/browse/MAHOUT-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1951: --- Sprint: Jan/Feb-2017 > Drivers don't run with remote Spark > --- > > Key: MAHOUT-1951 > URL: https://issues.apache.org/jira/browse/MAHOUT-1951 > Project: Mahout > Issue Type: Bug > Components: CLI >Affects Versions: 0.13.0 > Environment: The command line drivers spark-itemsimilarity and > spark-naivebayes using a remote or pseudo-clustered Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Blocker > Fix For: 0.13.0 > > > Missing classes when running these jobs because the dependencies-reduced jar, > passed to Spark for serialization purposes, does not contain all needed > classes. > Found by a user. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MAHOUT-1940) Provide a Java API to SimilarityAnalysis and any other needed APIs
[ https://issues.apache.org/jira/browse/MAHOUT-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1940: --- Description: We want to port the functionality from org.apache.mahout.math.cf.SimilarityAnalysis.scala to java for easy integration with a java project we will be creating that derives a similarity measure from the co-occurrence and cross-occurrence matrix. (was: We want to port the functionality from org.apache.mahout.math.cf.SimilarityAnalysis.scala to java for easy integration with a java project we will be creating that derives a similarity measure from the co-occurence matrix. ) > Provide a Java API to SimilarityAnalysis and any other needed APIs > --- > > Key: MAHOUT-1940 > URL: https://issues.apache.org/jira/browse/MAHOUT-1940 > Project: Mahout > Issue Type: New Feature > Components: Algorithms, cooccurrence >Reporter: James Mackey > > We want to port the functionality from > org.apache.mahout.math.cf.SimilarityAnalysis.scala to java for easy > integration with a java project we will be creating that derives a similarity > measure from the co-occurrence and cross-occurrence matrix. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (MAHOUT-1940) Provide a Java API to SimilarityAnalysis and any other needed APIs
[ https://issues.apache.org/jira/browse/MAHOUT-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1940: --- Summary: Provide a Java API to SimilarityAnalysis and any other needed APIs (was: Implementing similarity analysis using co-occurence matrix in java) > Provide a Java API to SimilarityAnalysis and any other needed APIs > --- > > Key: MAHOUT-1940 > URL: https://issues.apache.org/jira/browse/MAHOUT-1940 > Project: Mahout > Issue Type: New Feature > Components: Algorithms, cooccurrence >Reporter: James Mackey > > We want to port the functionality from > org.apache.mahout.math.cf.SimilarityAnalysis.scala to java for easy > integration with a java project we will be creating that derives a similarity > measure from the co-occurence matrix. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MAHOUT-1940) Implementing similarity analysis using co-occurence matrix in java
[ https://issues.apache.org/jira/browse/MAHOUT-1940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15862862#comment-15862862 ] Pat Ferrel commented on MAHOUT-1940: This would be Awesome! Let me know if you need help. There are some things that are no longer required. I just duplicated some methods to maintain backward compatibility, while adding new features. I also implemented some new helper object `apply` functions, which are alternative constructors, outside of Mahout in the PredictionIO Universal Recommender Template. When 0.5.1 of the Template is released concurrent with PIO 0.11.0 and Mahout 0.13.0. The ones in the Template code are all you will need for porting the Template to Java. To make SimilarityAnalysis complete and accepted into Mahout you'd probably need to port all of the SimilarityAnalysis class and IndexedDatasetSpark. > Implementing similarity analysis using co-occurence matrix in java > -- > > Key: MAHOUT-1940 > URL: https://issues.apache.org/jira/browse/MAHOUT-1940 > Project: Mahout > Issue Type: New Feature > Components: Algorithms, cooccurrence >Reporter: James Mackey > > We want to port the functionality from > org.apache.mahout.math.cf.SimilarityAnalysis.scala to java for easy > integration with a java project we will be creating that derives a similarity > measure from the co-occurence matrix. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (MAHOUT-1904) Create a test harness to test mahout across different hardware configurations
[ https://issues.apache.org/jira/browse/MAHOUT-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822907#comment-15822907 ] Pat Ferrel commented on MAHOUT-1904: Did you have in mind a CLI tool or unit test? I assume the former since this should be runnable on various clusters and configs? Is this meant to be a benchmark? Seems like an example maybe rather than CLI in mahout itself. Not sure I have enough time for 0.13.0 and no don't have a good test harness. If we move this out to the next release I'd be interested in doing it since recently I've become more interested in performance. [~Andrew_Palumbo] could you supply some examples? We have at least one medium sized dataset of 2 matrices (the dating site, can't recall the name) for the larger tests but still they are small compared to real-world data. > Create a test harness to test mahout across different hardware configurations > - > > Key: MAHOUT-1904 > URL: https://issues.apache.org/jira/browse/MAHOUT-1904 > Project: Mahout > Issue Type: Task >Affects Versions: 0.14.0 >Reporter: Andrew Palumbo > Labels: test > Fix For: 0.13.0 > > > Creat a set of simple scala programs to be run as a test harness for Linux > amd/intel, mac, and avx2(default). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1904) Create a test harness to test mahout across different hardware configurations
[ https://issues.apache.org/jira/browse/MAHOUT-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1904: --- Affects Version/s: (was: 0.12.2) 0.14.0 > Create a test harness to test mahout across different hardware configurations > - > > Key: MAHOUT-1904 > URL: https://issues.apache.org/jira/browse/MAHOUT-1904 > Project: Mahout > Issue Type: Task >Affects Versions: 0.14.0 >Reporter: Andrew Palumbo > Labels: test > Fix For: 0.13.0 > > > Creat a set of simple scala programs to be run as a test harness for Linux > amd/intel, mac, and avx2(default). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1786) Make classes implements Serializable for Spark 1.5+
[ https://issues.apache.org/jira/browse/MAHOUT-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1786: --- Assignee: Andrew Palumbo (was: Pat Ferrel) Hmm, removing Kryo altogether is probably a good idea, I have never touched this code and do not maintain classes that need this. All my classes either use data that is in the above types or base scala types that have serializable. I'm sending this back to [~Andrew_Palumbo] for reassignment or further discussion. If the new serializer if better than Kryo by all means let's move there ASAP. > Make classes implements Serializable for Spark 1.5+ > --- > > Key: MAHOUT-1786 > URL: https://issues.apache.org/jira/browse/MAHOUT-1786 > Project: Mahout > Issue Type: Improvement > Components: Math >Affects Versions: 0.11.0 >Reporter: Michel Lemay >Assignee: Andrew Palumbo >Priority: Blocker > Labels: performance > Fix For: 0.13.0 > > > Spark 1.5 comes with a new very efficient serializer that uses code > generation. It is twice as fast as kryo. When using mahout, we have to set > KryoSerializer because some classes aren't serializable otherwise. > I suggest to declare Math classes as "implements Serializable" where needed. > For instance, to use coocurence package in spark 1.5, we had to modify > AbstractMatrix, AbstractVector, DenseVector and SparseRowMatrix to make it > work without Kryo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1882) SequentialAccessSparseVector inerateNonZeros is incorrect.
[ https://issues.apache.org/jira/browse/MAHOUT-1882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15812561#comment-15812561 ] Pat Ferrel commented on MAHOUT-1882: Can't see that I use this, at least not obviously unless it is hidden in another call. Can you try removing the method and see who complains? > SequentialAccessSparseVector inerateNonZeros is incorrect. > -- > > Key: MAHOUT-1882 > URL: https://issues.apache.org/jira/browse/MAHOUT-1882 > Project: Mahout > Issue Type: Bug > Components: Math >Affects Versions: 0.12.2 >Reporter: Andrew Palumbo >Assignee: Suneel Marthi >Priority: Critical > Fix For: 0.13.0 > > > In {{SequentialAccessSparseVector}} a bug is noted. When Cuonting Non-Zero > elements {{NonDefaultIterator}} can, under certain circumstances give an > incorrect iterator of size different from the actual non-zeroCounts. > {code} > @Override > public Iterator iterateNonZero() { > // TODO: this is a bug, since nonDefaultIterator doesn't hold to non-zero > contract. > return new NonDefaultIterator(); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1786) Make classes implements Serializable for Spark 1.5+
[ https://issues.apache.org/jira/browse/MAHOUT-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15761631#comment-15761631 ] Pat Ferrel commented on MAHOUT-1786: It sounds like we could remove Kryo altogether and improve performance by using the new Spark serializer. It also sounds like this uses the more standard extending serializable, which is built into many Scala classes IIRC. Removing Kryo with a performance gains seems a big win. Kryo causes many config problems for new users. > Make classes implements Serializable for Spark 1.5+ > --- > > Key: MAHOUT-1786 > URL: https://issues.apache.org/jira/browse/MAHOUT-1786 > Project: Mahout > Issue Type: Improvement > Components: Math >Affects Versions: 0.11.0 >Reporter: Michel Lemay >Priority: Minor > Labels: performance > > Spark 1.5 comes with a new very efficient serializer that uses code > generation. It is twice as fast as kryo. When using mahout, we have to set > KryoSerializer because some classes aren't serializable otherwise. > I suggest to declare Math classes as "implements Serializable" where needed. > For instance, to use coocurence package in spark 1.5, we had to modify > AbstractMatrix, AbstractVector, DenseVector and SparseRowMatrix to make it > work without Kryo. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel resolved MAHOUT-1853. Resolution: Fixed > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO
[ https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel resolved MAHOUT-1883. Resolution: Fixed Hmm, I thought these were aut-resolved with a commit that contains the issue name? Maybe I had a senior moment there :-) > Create a type if IndexedDataset that filters unneeded data for CCO > -- > > Key: MAHOUT-1883 > URL: https://issues.apache.org/jira/browse/MAHOUT-1883 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering >Affects Versions: 0.13.0 >Reporter: Pat Ferrel >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > The collaborative filtering CCO algo uses drms for each "indicator" type. The > input must have the same set of user-id and so the row rank for all input > matrices must be the same. > In the past we have padded the row-id dictionary to include new rows only in > secondary matrices. This can lead to very large amounts of data processed in > the CCO pipeline that does not affect the results. Put another way if the row > doesn't exist in the primary matrix, there will be no cross-occurrence in the > other calculated cooccurrences matrix. > if we are calculating P'P and P'S, S will not need rows that don't exist in P > so this Jira is to create an IndexedDataset companion object that takes an > RDD[(String, String)] of interactions but that uses the dictionary from P for > row-ids and filters out all data that doesn't correspond to P. The companion > object will create the row-ids dictionary if it is not passed in, and use it > to filter if it is passed in. > We have seen data that can be reduced by many orders of magnitude using this > technique. This could be handled outside of Mahout but always produces better > performance and so this version of data-prep seems worth including. > It does not affect the CLI version yet but could be included there in a > future Jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO
[ https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1883: --- Issue Type: New Feature (was: Bug) > Create a type if IndexedDataset that filters unneeded data for CCO > -- > > Key: MAHOUT-1883 > URL: https://issues.apache.org/jira/browse/MAHOUT-1883 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering >Affects Versions: 0.13.0 >Reporter: Pat Ferrel >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > The collaborative filtering CCO algo uses drms for each "indicator" type. The > input must have the same set of user-id and so the row rank for all input > matrices must be the same. > In the past we have padded the row-id dictionary to include new rows only in > secondary matrices. This can lead to very large amounts of data processed in > the CCO pipeline that does not affect the results. Put another way if the row > doesn't exist in the primary matrix, there will be no cross-occurrence in the > other calculated cooccurrences matrix. > if we are calculating P'P and P'S, S will not need rows that don't exist in P > so this Jira is to create an IndexedDataset companion object that takes an > RDD[(String, String)] of interactions but that uses the dictionary from P for > row-ids and filters out all data that doesn't correspond to P. The companion > object will create the row-ids dictionary if it is not passed in, and use it > to filter if it is passed in. > We have seen data that can be reduced by many orders of magnitude using this > technique. This could be handled outside of Mahout but always produces better > performance and so this version of data-prep seems worth including. > It does not affect the CLI version yet but could be included there in a > future Jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO
[ https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1883: --- Sprint: Jan/Feb-2016 > Create a type if IndexedDataset that filters unneeded data for CCO > -- > > Key: MAHOUT-1883 > URL: https://issues.apache.org/jira/browse/MAHOUT-1883 > Project: Mahout > Issue Type: Bug > Components: Collaborative Filtering >Affects Versions: 0.13.0 >Reporter: Pat Ferrel >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > The collaborative filtering CCO algo uses drms for each "indicator" type. The > input must have the same set of user-id and so the row rank for all input > matrices must be the same. > In the past we have padded the row-id dictionary to include new rows only in > secondary matrices. This can lead to very large amounts of data processed in > the CCO pipeline that does not affect the results. Put another way if the row > doesn't exist in the primary matrix, there will be no cross-occurrence in the > other calculated cooccurrences matrix > if we are calculating P'P and P'S, S will not need rows that don't exist in P > so this Jira is to create an IndexedDataset companion object that takes an > RDD[(String, String)] of interactions but that uses the dictionary from P for > row-ids and filters out all data that doesn't correspond to P. The companion > object will create the row-ids dictionary if it is not passed in, and use it > to filter if it is passed in. > We have seen data that can be reduced by many orders of magnitude using this > technique. This could be handled outside of Mahout but always produces better > performance and so this version of data-prep seems worth including. > It does not effect the CLI version yet but could be included there in a > future Jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO
[ https://issues.apache.org/jira/browse/MAHOUT-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1883: --- Description: The collaborative filtering CCO algo uses drms for each "indicator" type. The input must have the same set of user-id and so the row rank for all input matrices must be the same. In the past we have padded the row-id dictionary to include new rows only in secondary matrices. This can lead to very large amounts of data processed in the CCO pipeline that does not affect the results. Put another way if the row doesn't exist in the primary matrix, there will be no cross-occurrence in the other calculated cooccurrences matrix. if we are calculating P'P and P'S, S will not need rows that don't exist in P so this Jira is to create an IndexedDataset companion object that takes an RDD[(String, String)] of interactions but that uses the dictionary from P for row-ids and filters out all data that doesn't correspond to P. The companion object will create the row-ids dictionary if it is not passed in, and use it to filter if it is passed in. We have seen data that can be reduced by many orders of magnitude using this technique. This could be handled outside of Mahout but always produces better performance and so this version of data-prep seems worth including. It does not affect the CLI version yet but could be included there in a future Jira. was: The collaborative filtering CCO algo uses drms for each "indicator" type. The input must have the same set of user-id and so the row rank for all input matrices must be the same. In the past we have padded the row-id dictionary to include new rows only in secondary matrices. This can lead to very large amounts of data processed in the CCO pipeline that does not affect the results. Put another way if the row doesn't exist in the primary matrix, there will be no cross-occurrence in the other calculated cooccurrences matrix if we are calculating P'P and P'S, S will not need rows that don't exist in P so this Jira is to create an IndexedDataset companion object that takes an RDD[(String, String)] of interactions but that uses the dictionary from P for row-ids and filters out all data that doesn't correspond to P. The companion object will create the row-ids dictionary if it is not passed in, and use it to filter if it is passed in. We have seen data that can be reduced by many orders of magnitude using this technique. This could be handled outside of Mahout but always produces better performance and so this version of data-prep seems worth including. It does not effect the CLI version yet but could be included there in a future Jira. > Create a type if IndexedDataset that filters unneeded data for CCO > -- > > Key: MAHOUT-1883 > URL: https://issues.apache.org/jira/browse/MAHOUT-1883 > Project: Mahout > Issue Type: Bug > Components: Collaborative Filtering >Affects Versions: 0.13.0 >Reporter: Pat Ferrel >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > The collaborative filtering CCO algo uses drms for each "indicator" type. The > input must have the same set of user-id and so the row rank for all input > matrices must be the same. > In the past we have padded the row-id dictionary to include new rows only in > secondary matrices. This can lead to very large amounts of data processed in > the CCO pipeline that does not affect the results. Put another way if the row > doesn't exist in the primary matrix, there will be no cross-occurrence in the > other calculated cooccurrences matrix. > if we are calculating P'P and P'S, S will not need rows that don't exist in P > so this Jira is to create an IndexedDataset companion object that takes an > RDD[(String, String)] of interactions but that uses the dictionary from P for > row-ids and filters out all data that doesn't correspond to P. The companion > object will create the row-ids dictionary if it is not passed in, and use it > to filter if it is passed in. > We have seen data that can be reduced by many orders of magnitude using this > technique. This could be handled outside of Mahout but always produces better > performance and so this version of data-prep seems worth including. > It does not affect the CLI version yet but could be included there in a > future Jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1883) Create a type if IndexedDataset that filters unneeded data for CCO
Pat Ferrel created MAHOUT-1883: -- Summary: Create a type if IndexedDataset that filters unneeded data for CCO Key: MAHOUT-1883 URL: https://issues.apache.org/jira/browse/MAHOUT-1883 Project: Mahout Issue Type: Bug Components: Collaborative Filtering Affects Versions: 0.13.0 Reporter: Pat Ferrel Assignee: Pat Ferrel Fix For: 0.13.0 The collaborative filtering CCO algo uses drms for each "indicator" type. The input must have the same set of user-id and so the row rank for all input matrices must be the same. In the past we have padded the row-id dictionary to include new rows only in secondary matrices. This can lead to very large amounts of data processed in the CCO pipeline that does not affect the results. Put another way if the row doesn't exist in the primary matrix, there will be no cross-occurrence in the other calculated cooccurrences matrix if we are calculating P'P and P'S, S will not need rows that don't exist in P so this Jira is to create an IndexedDataset companion object that takes an RDD[(String, String)] of interactions but that uses the dictionary from P for row-ids and filters out all data that doesn't correspond to P. The companion object will create the row-ids dictionary if it is not passed in, and use it to filter if it is passed in. We have seen data that can be reduced by many orders of magnitude using this technique. This could be handled outside of Mahout but always produces better performance and so this version of data-prep seems worth including. It does not effect the CLI version yet but could be included there in a future Jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1878) implement quartile type thresholds for indicator matrix downsampling
[ https://issues.apache.org/jira/browse/MAHOUT-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429553#comment-15429553 ] Pat Ferrel commented on MAHOUT-1878: see discussion here https://issues.apache.org/jira/browse/MAHOUT-1853 > implement quartile type thresholds for indicator matrix downsampling > > > Key: MAHOUT-1878 > URL: https://issues.apache.org/jira/browse/MAHOUT-1878 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering, cooccurrence >Reporter: Pat Ferrel >Assignee: Pat Ferrel > Fix For: 1.0.0 > > > https://issues.apache.org/jira/browse/MAHOUT-1853 > second half of the above, see discussion of downsampling by fraction of > matrix retained, perhaps using t-digest. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MAHOUT-1679) example script run-item-sim should work on hdfs as well as local
[ https://issues.apache.org/jira/browse/MAHOUT-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1679: --- Comment: was deleted (was: see discussion https://issues.apache.org/jira/browse/MAHOUT-1853) > example script run-item-sim should work on hdfs as well as local > > > Key: MAHOUT-1679 > URL: https://issues.apache.org/jira/browse/MAHOUT-1679 > Project: Mahout > Issue Type: Improvement > Components: Examples >Affects Versions: 0.10.0 >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Minor > Fix For: 1.0.0 > > > mahout/examples/bin/run-item-sim does not run on a cluster or pseudo-cluster > Spark + HDFS > It prints a warning and how to run in cluster but should just work in either > mode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1679) example script run-item-sim should work on hdfs as well as local
[ https://issues.apache.org/jira/browse/MAHOUT-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429552#comment-15429552 ] Pat Ferrel commented on MAHOUT-1679: see discussion https://issues.apache.org/jira/browse/MAHOUT-1853 > example script run-item-sim should work on hdfs as well as local > > > Key: MAHOUT-1679 > URL: https://issues.apache.org/jira/browse/MAHOUT-1679 > Project: Mahout > Issue Type: Improvement > Components: Examples >Affects Versions: 0.10.0 >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Minor > Fix For: 1.0.0 > > > mahout/examples/bin/run-item-sim does not run on a cluster or pseudo-cluster > Spark + HDFS > It prints a warning and how to run in cluster but should just work in either > mode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1878) implement quartile type thresholds for indicator matrix downsampling
Pat Ferrel created MAHOUT-1878: -- Summary: implement quartile type thresholds for indicator matrix downsampling Key: MAHOUT-1878 URL: https://issues.apache.org/jira/browse/MAHOUT-1878 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering, cooccurrence Reporter: Pat Ferrel Assignee: Pat Ferrel Fix For: 1.0.0 https://issues.apache.org/jira/browse/MAHOUT-1853 second half of the above, see discussion of downsampling by fraction of matrix retained, perhaps using t-digest. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429550#comment-15429550 ] Pat Ferrel commented on MAHOUT-1853: ok first part implemented. Not sure Ted's suggestion will get into this release so I'm moving this Jira to not loose his comments. Finished the fixed threshold and number of indicators per item for every pair of matrices. So A'A can have an llr threshold as well as a # per row that is different than A'B and so forth. > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15409595#comment-15409595 ] Pat Ferrel commented on MAHOUT-1853: Great, that's what I wanted to hear. Normal in principal but something more tolerant to wonky distributions is worth trying and in this case we'll avoid doing it every time by saving the threshold for future runs. Thanks > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408326#comment-15408326 ] Pat Ferrel commented on MAHOUT-1853: If t-digest is more tolerant of "not having enough data" than fitting the params of a normal dist then I'll do #1 and #2 now for 0.13. Then for #3 will integrate t-digest as a way to calculate the threshold for #2 in the next phase. #3 would be the release after, which would give us time to upgrade t-digest or cut it loose and treat as a dependency, it's in the maven repos. > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408256#comment-15408256 ] Pat Ferrel commented on MAHOUT-1853: is rootLLR normally distributed (the positive half)? If so we'd have to calculate all rootLLR scores and fit the normal params to get the 10% or other adaptive threshold, right? I understand that O(n^2) never occurs in practice. Even for cases where O(k k_max n) is high intuition would say that this threshold could be calculated once and applied for some time since it will tend to stay the same for any specific type of indicator. Calculating it may be a once in a great while operation and the threshold would usually be used in #2 above. I'm somewhat ignorant of t-digest other than having read your anomaly detection book. I think it's in Mahout but the docs are here: https://github.com/tdunning/t-digest. I assume that using t-digest would remove the need to do any separate distribution param fitting (as long as we use rootLLR) and could even be applied as online learning producing an adaptive threshold to feed into #2 above? I imagine it can also be applied periodically on P`X in batch. No need to respond if I'm on the right track. > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391126#comment-15391126 ] Pat Ferrel edited comment on MAHOUT-1853 at 8/4/16 4:15 PM: To reword this issue... The CCO analysis code currently only employs a single # of values per row of the P’X matrices. This has proven an insufficient threshold for many of the possible cross-occurrence types. The problem is that for a user * item input matrix, which becomes an item * item output a fixed # per row is fine but the implementation is a bit meaningless when there are only 20 columns of the X matrix. For instance if X = C category preferences, there may be only 20 possible categories and with a threshold of 100 and the fact that users often have enough usage to trigger preference events on all categories (though resulting in a small LLR value), the P’C matrix is almost completely full. This reduces any value in P’C. There are several ways to address: 1) have a # of indicators per row threshold for every P'X matrix, not one for all (the current impl) 2) use a fixed LLR threshold value per matrix 3) use a confidence of correlation value (a % maybe) that is calculated from the data by looking at the distribution in P’C or other. This is potentially O(n^2) where n = number of items in the matrix. This may be practical to calculate for some types of data since n may be very small. 1 and 2 are easy in the extreme, #3 can actually be calculated after the fact and used in #2 even if it is not included in Mahout. I've started work on #1 and #2 [~ssc][~tdunning] I'm especially looking for comments on #3 above, calculating a % confidence of correlation. The function we use for LLR scoring is https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/cf/SimilarityAnalysis.scala#L210 was (Author: pferrel): To reword this issue... The CCO analysis code currently only employs a single # of values per row of the P’? matrices. This has proven an insufficient threshold for many of the possible cross-occurrence types. The problem is that for a user * item input matrix, which becomes an item * item output a fixed # per row is fine but the implementation is a bit meaningless when there are only 20 columns of the ? matrix. For instance if ? = C category preferences, there may be only 20 possible categories and with a threshold of 100 and the fact that users often have enough usage to trigger preference events on all categories (though resulting in a small LLR value), the P’C matrix is almost completely full. This reduces any value in P’C. There are several ways to address: 1) have a # of indicators per row threshold for every matrix, not one for all (the current impl) 2) use a fixed LLR threshold value per matrix 3) use a confidence of correlation value (a % maybe) that is calculated from the data by looking at the distribution in P’C or other. This is potentially O(n^2) where n = number of items in the matrix. This may be practical to calculate for some types of data since n may be very small. 1 and 2 are easy in the extreme, #3 can actually be calculated after the fact and used in #2 even if it is not included in Mahout. starting work on #1 and #2 > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1853: --- Sprint: Jan/Feb-2016 > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15391126#comment-15391126 ] Pat Ferrel commented on MAHOUT-1853: To reword this issue... The CCO analysis code currently only employs a single # of values per row of the P’? matrices. This has proven an insufficient threshold for many of the possible cross-occurrence types. The problem is that for a user * item input matrix, which becomes an item * item output a fixed # per row is fine but the implementation is a bit meaningless when there are only 20 columns of the ? matrix. For instance if ? = C category preferences, there may be only 20 possible categories and with a threshold of 100 and the fact that users often have enough usage to trigger preference events on all categories (though resulting in a small LLR value), the P’C matrix is almost completely full. This reduces any value in P’C. There are several ways to address: 1) have a # of indicators per row threshold for every matrix, not one for all (the current impl) 2) use a fixed LLR threshold value per matrix 3) use a confidence of correlation value (a % maybe) that is calculated from the data by looking at the distribution in P’C or other. This is potentially O(n^2) where n = number of items in the matrix. This may be practical to calculate for some types of data since n may be very small. 1 and 2 are easy in the extreme, #3 can actually be calculated after the fact and used in #2 even if it is not included in Mahout. starting work on #1 and #2 > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371 ] Pat Ferrel edited comment on MAHOUT-1853 at 5/26/16 5:04 PM: - Steps: 1) allow an array of absolute LLR value thresholds, one for each matrix pair 2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1 #1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O\(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1. #2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser. The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1. Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job. For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try. Any comments from [~tdunning] or [~dlyubimov] would be welcome was (Author: pferrel): Steps: 1) allow an array of absolute LLR value thresholds, one for each matrix pair 2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1 #1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1. #2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser. The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1. Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job. For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try. Any comments from [~tdunning] or [~dlyubimov] would be welcome > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371 ] Pat Ferrel edited comment on MAHOUT-1853 at 5/26/16 5:03 PM: - Steps: 1) allow an array of absolute LLR value thresholds, one for each matrix pair 2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1 #1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1. #2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser. The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1. Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job. For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try. Any comments from [~tdunning] or [~dlyubimov] would be welcome was (Author: pferrel): Steps: 1) allow an array of absolute LLR value thresholds for each matrix pair 2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1 #1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1. #2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser. The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1. Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job. For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try. Any comments from [~tdunning] or [~dlyubimov] would be welcome > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1853) Improvements to CCO (Correlated Cross-Occurrence)
[ https://issues.apache.org/jira/browse/MAHOUT-1853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15302371#comment-15302371 ] Pat Ferrel commented on MAHOUT-1853: Steps: 1) allow an array of absolute LLR value thresholds for each matrix pair 2) allow thresholds to be a confidence of correlation (actually confidence that non-correlation is rejected) or fraction of total cross-occurrences are retained after downsampling. To reduce how often this must be done the absolute value thresholds should be output after calculation for later re-use in #1 #1 is very easy but not all that useful since LLR values will vary quite a bit. #1 also retains the O(n) computation complexity. I imagine #1 would be used with #2 since #2 is much more computationally complex and can output thresholds for #1. #2 require worst-case O(n^2) complexity. Some matrix pairs will have low dimensionality in one direction or both. In fact this low dimensionality is the reason we need a different kind of downsampling for these pairs. Imagine a conversion A'A which is items by items and may be very large but sparse, then A'B may be products by gender, so a rank of 2 columns but much denser. The calculation for #2 would, I believe, require performing the un-downsampled A'A then determining the threshold from the LLR scores, then making another pass to downsample, this will add significant computation time and could make it impractical except for rare re-calculation tasks. In which case the absolute threshold would be recorded and used for subsequent A'A and A'B using #1. Since it is likely to be impractical to calculate #2 very often it may be better done as an analytics job rather than part of the A'A job. For most recommender cases the current downsampling method is fine but for other uses of the CCO algorithm #2 may be required for occasional threshold re-calc. In some sense we won't know until we try. Any comments from [~tdunning] or [~dlyubimov] would be welcome > Improvements to CCO (Correlated Cross-Occurrence) > - > > Key: MAHOUT-1853 > URL: https://issues.apache.org/jira/browse/MAHOUT-1853 > Project: Mahout > Issue Type: New Feature >Affects Versions: 0.12.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.13.0 > > > Improvements to CCO (Correlated Cross-Occurrence) to include auto-threshold > calculation for LLR downsampling, and possible multiple fixed thresholds for > A’A, A’B etc. This is to account for the vast difference in dimensionality > between indicator types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MAHOUT-1766) Increase default PermGen size for spark-shell
[ https://issues.apache.org/jira/browse/MAHOUT-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel reassigned MAHOUT-1766: -- Assignee: Andrew Palumbo (was: Pat Ferrel) I don't use the shell much, is this legit Andy? > Increase default PermGen size for spark-shell > - > > Key: MAHOUT-1766 > URL: https://issues.apache.org/jira/browse/MAHOUT-1766 > Project: Mahout > Issue Type: Improvement > Components: Mahout spark shell >Affects Versions: 0.11.0 >Reporter: Sergey Tryuber >Assignee: Andrew Palumbo > Fix For: 0.12.0 > > > Mahout spark-shell is run with default perm gen size (64MB). Taking into > account that it depends on lots of external jars and the whole count of used > Java classes is very large, we constantly observe spontaneous corresponding > OOM exceptions. > A hot fix from our side is to modify envelope bash script (added > -XX:PermSize=512m): > {code} > "$JAVA" $JAVA_HEAP_MAX -XX:PermSize=512m $MAHOUT_OPTS -classpath "$CLASSPATH" > "org.apache.mahout.sparkbindings.shell.Main" $@ > {code} > Of course, more elegant solution is needed. After the applied fix, the errors > had gone. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAHOUT-1689) Create a doc on how to write an app that uses Mahout as a lib
[ https://issues.apache.org/jira/browse/MAHOUT-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel resolved MAHOUT-1689. Resolution: Fixed > Create a doc on how to write an app that uses Mahout as a lib > - > > Key: MAHOUT-1689 > URL: https://issues.apache.org/jira/browse/MAHOUT-1689 > Project: Mahout > Issue Type: Documentation >Affects Versions: 0.10.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 1.0.0 > > > Create a doc on how to write an app that uses Mahout as a lib -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1799) Read null row vectors from file in TextDelimeterReaderWriter driver
[ https://issues.apache.org/jira/browse/MAHOUT-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1799: --- Fix Version/s: (was: 0.12.0) 1.0.0 > Read null row vectors from file in TextDelimeterReaderWriter driver > --- > > Key: MAHOUT-1799 > URL: https://issues.apache.org/jira/browse/MAHOUT-1799 > Project: Mahout > Issue Type: Improvement > Components: spark >Reporter: Jussi Jousimo >Assignee: Pat Ferrel >Priority: Minor > Fix For: 1.0.0 > > > Since some row vectors in a sparse matrix can be null, Mahout writes them out > to a file with the row label only. However, Mahout cannot read these files, > but throws an exception when it encounters a label-only row. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1679) example script run-item-sim should work on hdfs as well as local
[ https://issues.apache.org/jira/browse/MAHOUT-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1679: --- Fix Version/s: (was: 0.12.0) 1.0.0 Issue Type: Improvement (was: Bug) > example script run-item-sim should work on hdfs as well as local > > > Key: MAHOUT-1679 > URL: https://issues.apache.org/jira/browse/MAHOUT-1679 > Project: Mahout > Issue Type: Improvement > Components: Examples >Affects Versions: 0.10.0 >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Minor > Fix For: 1.0.0 > > > mahout/examples/bin/run-item-sim does not run on a cluster or pseudo-cluster > Spark + HDFS > It prints a warning and how to run in cluster but should just work in either > mode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
[ https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199694#comment-15199694 ] Pat Ferrel commented on MAHOUT-1762: Do you know of something that is blocked by this? Not sure what is being asked for. > Pick up $SPARK_HOME/conf/spark-defaults.conf on startup > --- > > Key: MAHOUT-1762 > URL: https://issues.apache.org/jira/browse/MAHOUT-1762 > Project: Mahout > Issue Type: Improvement > Components: spark >Reporter: Sergey Tryuber >Assignee: Pat Ferrel > Fix For: 1.0.0 > > > [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties] > is aimed to contain global configuration for Spark cluster. For example, in > our HDP2.2 environment it contains: > {noformat} > spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > {noformat} > and there are many other good things. Actually it is expected that when a > user starts Spark Shell, it will be working fine. Unfortunately this does not > happens with Mahout Spark Shell, because it ignores spark configuration and > user has to copy-past lots of options into _MAHOUT_OPTS_. > This happens because > [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala] > is executed directly in [initialization > script|https://github.com/apache/mahout/blob/master/bin/mahout]: > {code} > "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" > "org.apache.mahout.sparkbindings.shell.Main" $@ > {code} > In contrast, in Spark shell is indirectly invoked through spark-submit in > [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] > script: > {code} > "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@" > {code} > [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala] > contains an additional initialization layer for loading properties file (see > SparkSubmitArguments#mergeDefaultSparkProperties method). > So there are two possible solutions: > * use proper Spark-like initialization logic > * use thin envelope like it is in H2O Sparkling Water > ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
[ https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199672#comment-15199672 ] Pat Ferrel commented on MAHOUT-1762: I agree with the reasoning for this but the drivers have a pass-through to Spark for arbitrary key=value pairs and switching to sparksubmit was voted down so it was never done. If you are using Mahout as a lib you can set anything in the SparkConf that you want so not sure what is remaining here but a more than reasonable complaint about how the launcher scripts are structured. > Pick up $SPARK_HOME/conf/spark-defaults.conf on startup > --- > > Key: MAHOUT-1762 > URL: https://issues.apache.org/jira/browse/MAHOUT-1762 > Project: Mahout > Issue Type: Improvement > Components: spark >Reporter: Sergey Tryuber >Assignee: Pat Ferrel > Fix For: 1.0.0 > > > [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties] > is aimed to contain global configuration for Spark cluster. For example, in > our HDP2.2 environment it contains: > {noformat} > spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > {noformat} > and there are many other good things. Actually it is expected that when a > user starts Spark Shell, it will be working fine. Unfortunately this does not > happens with Mahout Spark Shell, because it ignores spark configuration and > user has to copy-past lots of options into _MAHOUT_OPTS_. > This happens because > [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala] > is executed directly in [initialization > script|https://github.com/apache/mahout/blob/master/bin/mahout]: > {code} > "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" > "org.apache.mahout.sparkbindings.shell.Main" $@ > {code} > In contrast, in Spark shell is indirectly invoked through spark-submit in > [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] > script: > {code} > "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@" > {code} > [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala] > contains an additional initialization layer for loading properties file (see > SparkSubmitArguments#mergeDefaultSparkProperties method). > So there are two possible solutions: > * use proper Spark-like initialization logic > * use thin envelope like it is in H2O Sparkling Water > ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1689) Create a doc on how to write an app that uses Mahout as a lib
[ https://issues.apache.org/jira/browse/MAHOUT-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199754#comment-15199754 ] Pat Ferrel commented on MAHOUT-1689: Done, many times over by several people. Mine is here: http://mahout.apache.org/users/environment/how-to-build-an-app.html > Create a doc on how to write an app that uses Mahout as a lib > - > > Key: MAHOUT-1689 > URL: https://issues.apache.org/jira/browse/MAHOUT-1689 > Project: Mahout > Issue Type: Documentation >Affects Versions: 0.10.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 1.0.0 > > > Create a doc on how to write an app that uses Mahout as a lib -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1788) spark-itemsimilarity integration test script cleanup
[ https://issues.apache.org/jira/browse/MAHOUT-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1788: --- Fix Version/s: (was: 0.12.0) 1.0.0 Issue Type: Improvement (was: Bug) work on this as time is available, not blocking anything IMO > spark-itemsimilarity integration test script cleanup > > > Key: MAHOUT-1788 > URL: https://issues.apache.org/jira/browse/MAHOUT-1788 > Project: Mahout > Issue Type: Improvement > Components: cooccurrence >Affects Versions: 0.11.0 >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Trivial > Fix For: 1.0.0 > > > binary release does not contain data for itemsimilarity tests, neith binary > nor source versions will run on a cluster unless data is hand copied to hdfs. > Clean this up so it copies data if needed and the data is in both versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1799) Read null row vectors from file in TextDelimeterReaderWriter driver
[ https://issues.apache.org/jira/browse/MAHOUT-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199770#comment-15199770 ] Pat Ferrel commented on MAHOUT-1799: Can't test this or even merge it right now so if someone else can merge, great otherwise is doesn't seem like a requirement for release and so unless someone speaks up I'll push to 1.0 > Read null row vectors from file in TextDelimeterReaderWriter driver > --- > > Key: MAHOUT-1799 > URL: https://issues.apache.org/jira/browse/MAHOUT-1799 > Project: Mahout > Issue Type: Improvement > Components: spark >Reporter: Jussi Jousimo >Assignee: Pat Ferrel >Priority: Minor > Fix For: 1.0.0 > > > Since some row vectors in a sparse matrix can be null, Mahout writes them out > to a file with the row label only. However, Mahout cannot read these files, > but throws an exception when it encounters a label-only row. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
[ https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel resolved MAHOUT-1762. Resolution: Won't Fix We don't know of anything this blocks and moving to using sparksubmit was voted down, which only applies to Mahout CLI drivers anyway. All CLI drivers support passthrough of arbitrary key=value pairs, which go into the SparkConf and when using Mahout as a Lib you can create any arbitrary SparkConf. Will not fix unless someone can explain the need. > Pick up $SPARK_HOME/conf/spark-defaults.conf on startup > --- > > Key: MAHOUT-1762 > URL: https://issues.apache.org/jira/browse/MAHOUT-1762 > Project: Mahout > Issue Type: Improvement > Components: spark >Reporter: Sergey Tryuber >Assignee: Pat Ferrel > Fix For: 1.0.0 > > > [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties] > is aimed to contain global configuration for Spark cluster. For example, in > our HDP2.2 environment it contains: > {noformat} > spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > {noformat} > and there are many other good things. Actually it is expected that when a > user starts Spark Shell, it will be working fine. Unfortunately this does not > happens with Mahout Spark Shell, because it ignores spark configuration and > user has to copy-past lots of options into _MAHOUT_OPTS_. > This happens because > [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala] > is executed directly in [initialization > script|https://github.com/apache/mahout/blob/master/bin/mahout]: > {code} > "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" > "org.apache.mahout.sparkbindings.shell.Main" $@ > {code} > In contrast, in Spark shell is indirectly invoked through spark-submit in > [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] > script: > {code} > "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@" > {code} > [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala] > contains an additional initialization layer for loading properties file (see > SparkSubmitArguments#mergeDefaultSparkProperties method). > So there are two possible solutions: > * use proper Spark-like initialization logic > * use thin envelope like it is in H2O Sparkling Water > ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1679) example script run-item-sim should work on hdfs as well as local
[ https://issues.apache.org/jira/browse/MAHOUT-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15199746#comment-15199746 ] Pat Ferrel commented on MAHOUT-1679: this is just a test script that doesn't account for using HDFS and expect localfs, so not important. > example script run-item-sim should work on hdfs as well as local > > > Key: MAHOUT-1679 > URL: https://issues.apache.org/jira/browse/MAHOUT-1679 > Project: Mahout > Issue Type: Bug > Components: Examples >Affects Versions: 0.10.0 >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Minor > Fix For: 1.0.0 > > > mahout/examples/bin/run-item-sim does not run on a cluster or pseudo-cluster > Spark + HDFS > It prints a warning and how to run in cluster but should just work in either > mode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1788) spark-itemsimilarity integration test script cleanup
Pat Ferrel created MAHOUT-1788: -- Summary: spark-itemsimilarity integration test script cleanup Key: MAHOUT-1788 URL: https://issues.apache.org/jira/browse/MAHOUT-1788 Project: Mahout Issue Type: Bug Components: cooccurrence Affects Versions: 0.11.0 Reporter: Pat Ferrel Assignee: Pat Ferrel Priority: Trivial Fix For: 0.12.0 binary release does not contain data for itemsimilarity tests, neith binary nor source versions will run on a cluster unless data is hand copied to hdfs. Clean this up so it copies data if needed and the data is in both versions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1785) Replace 'spark.kryoserializer.buffer.mb' from Spark config
[ https://issues.apache.org/jira/browse/MAHOUT-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994178#comment-14994178 ] Pat Ferrel commented on MAHOUT-1785: This happens because Spark is changing the way a conf param is being used. The warning seems to apply into Spark 1.6-SNAPSHOT so if it works, not a blocker and we have to have it for Spark 1.4.1 or below. So we can't do anything about this until we require 1.5.1, which is not in Mahout 0.11.1 so defer this. > Replace 'spark.kryoserializer.buffer.mb' from Spark config > -- > > Key: MAHOUT-1785 > URL: https://issues.apache.org/jira/browse/MAHOUT-1785 > Project: Mahout > Issue Type: Improvement > Components: Mahout spark shell >Affects Versions: 0.11.0 >Reporter: Suneel Marthi >Assignee: Suneel Marthi >Priority: Trivial > Fix For: 0.12.0 > > > 'spark.kryoserializer.buffer.mb' has been deprecated as of spark 1.4 and > should be replaced by 'spark.kryoserializer.buffer' -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1785) Replace 'spark.kryoserializer.buffer.mb' from Spark config
[ https://issues.apache.org/jira/browse/MAHOUT-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1785: --- Fix Version/s: (was: 0.11.1) 0.12.0 > Replace 'spark.kryoserializer.buffer.mb' from Spark config > -- > > Key: MAHOUT-1785 > URL: https://issues.apache.org/jira/browse/MAHOUT-1785 > Project: Mahout > Issue Type: Improvement > Components: Mahout spark shell >Affects Versions: 0.11.0 >Reporter: Suneel Marthi >Assignee: Suneel Marthi >Priority: Trivial > Fix For: 0.12.0 > > > 'spark.kryoserializer.buffer.mb' has been deprecated as of spark 1.4 and > should be replaced by 'spark.kryoserializer.buffer' -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAHOUT-1618) Cooccurrence Recommender example and documentation
[ https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel resolved MAHOUT-1618. Resolution: Fixed > Cooccurrence Recommender example and documentation > --- > > Key: MAHOUT-1618 > URL: https://issues.apache.org/jira/browse/MAHOUT-1618 > Project: Mahout > Issue Type: Documentation > Components: Examples >Reporter: Thejas Prasad >Assignee: Pat Ferrel >Priority: Trivial > Labels: DSL, cooccurence, scala, spark > Fix For: 0.11.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1618) Cooccurrence Recommender example and documentation
[ https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992930#comment-14992930 ] Pat Ferrel commented on MAHOUT-1618: full-featured "Universal Recommender" using Mahout cooccurrence with multimodality up the yazoo. Apache 2 licence. https://github.com/PredictionIO/template-scala-parallel-universal-recommendation Docs on the main site for the PIO framework and the README.md for the recommender > Cooccurrence Recommender example and documentation > --- > > Key: MAHOUT-1618 > URL: https://issues.apache.org/jira/browse/MAHOUT-1618 > Project: Mahout > Issue Type: Documentation > Components: Examples >Reporter: Thejas Prasad >Assignee: Pat Ferrel >Priority: Trivial > Labels: DSL, cooccurence, scala, spark > Fix For: 0.11.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
[ https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1762: --- Fix Version/s: (was: 0.12.0) 1.0.0 > Pick up $SPARK_HOME/conf/spark-defaults.conf on startup > --- > > Key: MAHOUT-1762 > URL: https://issues.apache.org/jira/browse/MAHOUT-1762 > Project: Mahout > Issue Type: Wish > Components: spark >Reporter: Sergey Tryuber > Fix For: 1.0.0 > > > [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties] > is aimed to contain global configuration for Spark cluster. For example, in > our HDP2.2 environment it contains: > {noformat} > spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > {noformat} > and there are many other good things. Actually it is expected that when a > user starts Spark Shell, it will be working fine. Unfortunately this does not > happens with Mahout Spark Shell, because it ignores spark configuration and > user has to copy-past lots of options into _MAHOUT_OPTS_. > This happens because > [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala] > is executed directly in [initialization > script|https://github.com/apache/mahout/blob/master/bin/mahout]: > {code} > "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" > "org.apache.mahout.sparkbindings.shell.Main" $@ > {code} > In contrast, in Spark shell is indirectly invoked through spark-submit in > [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] > script: > {code} > "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@" > {code} > [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala] > contains an additional initialization layer for loading properties file (see > SparkSubmitArguments#mergeDefaultSparkProperties method). > So there are two possible solutions: > * use proper Spark-like initialization logic > * use thin envelope like it is in H2O Sparkling Water > ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
[ https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1762: --- Fix Version/s: 0.12.0 > Pick up $SPARK_HOME/conf/spark-defaults.conf on startup > --- > > Key: MAHOUT-1762 > URL: https://issues.apache.org/jira/browse/MAHOUT-1762 > Project: Mahout > Issue Type: Wish > Components: spark >Reporter: Sergey Tryuber > Fix For: 0.12.0 > > > [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties] > is aimed to contain global configuration for Spark cluster. For example, in > our HDP2.2 environment it contains: > {noformat} > spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > {noformat} > and there are many other good things. Actually it is expected that when a > user starts Spark Shell, it will be working fine. Unfortunately this does not > happens with Mahout Spark Shell, because it ignores spark configuration and > user has to copy-past lots of options into _MAHOUT_OPTS_. > This happens because > [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala] > is executed directly in [initialization > script|https://github.com/apache/mahout/blob/master/bin/mahout]: > {code} > "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" > "org.apache.mahout.sparkbindings.shell.Main" $@ > {code} > In contrast, in Spark shell is indirectly invoked through spark-submit in > [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] > script: > {code} > "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@" > {code} > [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala] > contains an additional initialization layer for loading properties file (see > SparkSubmitArguments#mergeDefaultSparkProperties method). > So there are two possible solutions: > * use proper Spark-like initialization logic > * use thin envelope like it is in H2O Sparkling Water > ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1762) Pick up $SPARK_HOME/conf/spark-defaults.conf on startup
[ https://issues.apache.org/jira/browse/MAHOUT-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992093#comment-14992093 ] Pat Ferrel commented on MAHOUT-1762: Very good point. We need to move to spark-submit and away from directly creating the Spark context IMHO. I'd vote to put reworking the launcher code for the shell and drivers on the roadmap for 0.12.0. > Pick up $SPARK_HOME/conf/spark-defaults.conf on startup > --- > > Key: MAHOUT-1762 > URL: https://issues.apache.org/jira/browse/MAHOUT-1762 > Project: Mahout > Issue Type: Wish > Components: spark >Reporter: Sergey Tryuber > > [spark-defaults.conf|http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties] > is aimed to contain global configuration for Spark cluster. For example, in > our HDP2.2 environment it contains: > {noformat} > spark.driver.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > spark.yarn.am.extraJavaOptions -Dhdp.version=2.2.0.0–2041 > {noformat} > and there are many other good things. Actually it is expected that when a > user starts Spark Shell, it will be working fine. Unfortunately this does not > happens with Mahout Spark Shell, because it ignores spark configuration and > user has to copy-past lots of options into _MAHOUT_OPTS_. > This happens because > [org.apache.mahout.sparkbindings.shell.Main|https://github.com/apache/mahout/blob/master/spark-shell/src/main/scala/org/apache/mahout/sparkbindings/shell/Main.scala] > is executed directly in [initialization > script|https://github.com/apache/mahout/blob/master/bin/mahout]: > {code} > "$JAVA" $JAVA_HEAP_MAX $MAHOUT_OPTS -classpath "$CLASSPATH" > "org.apache.mahout.sparkbindings.shell.Main" $@ > {code} > In contrast, in Spark shell is indirectly invoked through spark-submit in > [spark-shell|https://github.com/apache/spark/blob/master/bin/spark-shell] > script: > {code} > "$FWDIR"/bin/spark-submit --class org.apache.spark.repl.Main "$@" > {code} > [SparkSubmit|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala] > contains an additional initialization layer for loading properties file (see > SparkSubmitArguments#mergeDefaultSparkProperties method). > So there are two possible solutions: > * use proper Spark-like initialization logic > * use thin envelope like it is in H2O Sparkling Water > ([sparkling-shell|https://github.com/h2oai/sparkling-water/blob/master/bin/sparkling-shell]) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MAHOUT-1618) Cooccurrence Recommender example and documentation
[ https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692591#comment-14692591 ] Pat Ferrel edited comment on MAHOUT-1618 at 8/12/15 12:07 AM: -- just created a project using PredictionIO's framework which integrates Spark, HBase, and Elasticsearch. Added Mahout cooccurrence and implemented the rest of the recommender. This is not only an OSS integration example but a running virtually turnkey recommender. I could update the item and row similarity docs on Mahout a bit and point to the "template" as an example. A new version will be released in a week or so that uses Mahout 0.11.0 was (Author: pferrel): just created a project using PredictionIO's framework which integrates Spark, HBase, and Elasticsearch. Added Mahout cooccurrence and implemented the rest of the recommender. This is not only an OSS integration example but a running virtually turnkey recommender. I could update the item and row similarity docs on Mahout a bit and point to the "template" as an example. > Cooccurrence Recommender example and documentation > --- > > Key: MAHOUT-1618 > URL: https://issues.apache.org/jira/browse/MAHOUT-1618 > Project: Mahout > Issue Type: Documentation > Components: Examples >Reporter: Thejas Prasad >Assignee: Pat Ferrel >Priority: Trivial > Labels: DSL, cooccurence, scala, spark > Fix For: 0.11.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1618) Cooccurrence Recommender example and documentation
[ https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692591#comment-14692591 ] Pat Ferrel commented on MAHOUT-1618: just created a project using PredictionIO's framework which integrates Spark, HBase, and Elasticsearch. Added Mahout cooccurrence and implemented the rest of the recommender. This is not only an OSS integration example but a running virtually turnkey recommender. I could update the item and row similarity docs on Mahout a bit and point to the "template" as an example. > Cooccurrence Recommender example and documentation > --- > > Key: MAHOUT-1618 > URL: https://issues.apache.org/jira/browse/MAHOUT-1618 > Project: Mahout > Issue Type: Documentation > Components: Examples >Reporter: Thejas Prasad >Assignee: Pat Ferrel >Priority: Trivial > Labels: DSL, cooccurence, scala, spark > Fix For: 0.11.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAHOUT-1641) Add conversion from a RDD[(String, String)] to a Drm[Int]
[ https://issues.apache.org/jira/browse/MAHOUT-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel resolved MAHOUT-1641. Resolution: Implemented Fix Version/s: 0.10.1 Actually implemented this before I saw this Jira. > Add conversion from a RDD[(String, String)] to a Drm[Int] > - > > Key: MAHOUT-1641 > URL: https://issues.apache.org/jira/browse/MAHOUT-1641 > Project: Mahout > Issue Type: Question > Components: spark >Affects Versions: 0.9 >Reporter: Erlend Hamnaberg >Assignee: Pat Ferrel > Labels: DSL, scala, spark > Fix For: 0.10.1, 0.11.0 > > > Hi. > We are using the coocurrence part of mahout as a library. We get our data > from other sources, like for instance Cassandra. We dont want to write that > data to disk, and read it back since we already have the data on each slave. > I have created some conversion functions based on one of the > IndexedDatasetSpark readers, cant remember which one at the moment. > Is there interest in the community for this kind of feature? I can probably > clean it up and add this as a github pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (MAHOUT-1641) Add conversion from a RDD[(String, String)] to a Drm[Int]
[ https://issues.apache.org/jira/browse/MAHOUT-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel reopened MAHOUT-1641: Assignee: Pat Ferrel (was: Dmitriy Lyubimov) > Add conversion from a RDD[(String, String)] to a Drm[Int] > - > > Key: MAHOUT-1641 > URL: https://issues.apache.org/jira/browse/MAHOUT-1641 > Project: Mahout > Issue Type: Question > Components: spark >Affects Versions: 0.9 >Reporter: Erlend Hamnaberg >Assignee: Pat Ferrel > Labels: DSL, scala, spark > Fix For: 0.11.0 > > > Hi. > We are using the coocurrence part of mahout as a library. We get our data > from other sources, like for instance Cassandra. We dont want to write that > data to disk, and read it back since we already have the data on each slave. > I have created some conversion functions based on one of the > IndexedDatasetSpark readers, cant remember which one at the moment. > Is there interest in the community for this kind of feature? I can probably > clean it up and add this as a github pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1641) Add conversion from a RDD[(String, String)] to a Drm[Int]
[ https://issues.apache.org/jira/browse/MAHOUT-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14570044#comment-14570044 ] Pat Ferrel commented on MAHOUT-1641: Hmm didn't see this earlier. There is now a secondary "apply" constructor in the companion object for IndexedDatasetSpark that takes an RDD[(String, String)]. See here: https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/sparkbindings/indexeddataset/IndexedDatasetSpark.scala > Add conversion from a RDD[(String, String)] to a Drm[Int] > - > > Key: MAHOUT-1641 > URL: https://issues.apache.org/jira/browse/MAHOUT-1641 > Project: Mahout > Issue Type: Question > Components: spark >Affects Versions: 0.9 >Reporter: Erlend Hamnaberg >Assignee: Dmitriy Lyubimov > Labels: DSL, scala, spark > Fix For: 0.11.0 > > > Hi. > We are using the coocurrence part of mahout as a library. We get our data > from other sources, like for instance Cassandra. We dont want to write that > data to disk, and read it back since we already have the data on each slave. > I have created some conversion functions based on one of the > IndexedDatasetSpark readers, cant remember which one at the moment. > Is there interest in the community for this kind of feature? I can probably > clean it up and add this as a github pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAHOUT-1707) Spark-itemsimilarity uses too much memory
[ https://issues.apache.org/jira/browse/MAHOUT-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel resolved MAHOUT-1707. Resolution: Fixed removed bad collect. > Spark-itemsimilarity uses too much memory > - > > Key: MAHOUT-1707 > URL: https://issues.apache.org/jira/browse/MAHOUT-1707 > Project: Mahout > Issue Type: Bug > Components: Collaborative Filtering, cooccurrence >Affects Versions: 0.10.0 > Environment: Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel > Fix For: 0.10.1 > > > java.lang.OutOfMemoryError: Java heap space > The code has an unnecessary .collect(), forcing all interaction data into > memory of the client/driver. Increasing the executor memory will not help > with this. > remove this line and rebuild Mahout. > https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157 > The errant line reads: > interactions.collect() > This forces the user action data into memory, a bad thing for memory > consumption. Removing it should allow for better Spark memory management. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1708) Replace Google/Guava in mahout-math and mahout-hdfs
[ https://issues.apache.org/jira/browse/MAHOUT-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1708: --- Summary: Replace Google/Guava in mahout-math and mahout-hdfs (was: Replace Preconditions with asserts for Spark code) > Replace Google/Guava in mahout-math and mahout-hdfs > --- > > Key: MAHOUT-1708 > URL: https://issues.apache.org/jira/browse/MAHOUT-1708 > Project: Mahout > Issue Type: Bug > Components: Hdfs, Math >Affects Versions: 0.10.0 > Environment: Spark >Reporter: Pat Ferrel >Assignee: Andrew Musselman > Fix For: 0.10.1 > > > all use of guava has been removed from the code used with Spark except the > use of Preconditions. These are pretty easy to replace. > 1) remove guava from mahout-math, mahout-hdfs, poms and the spark > dependency-reduced assembly. > 2) you will now get compile errors for math and hdfs so remove the imports > and replace and Preconditions with asserts. > Not sure how many errors in replacing these will be caught with unit tests so > be careful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MAHOUT-1708) Replace Preconditions with asserts for Spark code
[ https://issues.apache.org/jira/browse/MAHOUT-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550671#comment-14550671 ] Pat Ferrel edited comment on MAHOUT-1708 at 5/19/15 4:01 PM: - AbstractIterator and Map uses from guava are also problems here (Andrew's comments on IM) [~andrew.musselman] can you create a PR branch so others can help with this? was (Author: pferrel): [~andrew.musselman] can you create a PR branch so others can help with this? > Replace Preconditions with asserts for Spark code > - > > Key: MAHOUT-1708 > URL: https://issues.apache.org/jira/browse/MAHOUT-1708 > Project: Mahout > Issue Type: Bug > Components: Hdfs, Math >Affects Versions: 0.10.0 > Environment: Spark >Reporter: Pat Ferrel >Assignee: Andrew Musselman > Fix For: 0.10.1 > > > all use of guava has been removed from the code used with Spark except the > use of Preconditions. These are pretty easy to replace. > 1) remove guava from mahout-math, mahout-hdfs, poms and the spark > dependency-reduced assembly. > 2) you will now get compile errors for math and hdfs so remove the imports > and replace and Preconditions with asserts. > Not sure how many errors in replacing these will be caught with unit tests so > be careful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1708) Replace Preconditions with asserts for Spark code
[ https://issues.apache.org/jira/browse/MAHOUT-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14550671#comment-14550671 ] Pat Ferrel commented on MAHOUT-1708: [~andrew.musselman] can you create a PR branch so others can help with this? > Replace Preconditions with asserts for Spark code > - > > Key: MAHOUT-1708 > URL: https://issues.apache.org/jira/browse/MAHOUT-1708 > Project: Mahout > Issue Type: Bug > Components: Hdfs, Math >Affects Versions: 0.10.0 > Environment: Spark >Reporter: Pat Ferrel >Assignee: Andrew Musselman > Fix For: 0.10.1 > > > all use of guava has been removed from the code used with Spark except the > use of Preconditions. These are pretty easy to replace. > 1) remove guava from mahout-math, mahout-hdfs, poms and the spark > dependency-reduced assembly. > 2) you will now get compile errors for math and hdfs so remove the imports > and replace and Preconditions with asserts. > Not sure how many errors in replacing these will be caught with unit tests so > be careful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1708) Replace Preconditions with asserts for Spark code
Pat Ferrel created MAHOUT-1708: -- Summary: Replace Preconditions with asserts for Spark code Key: MAHOUT-1708 URL: https://issues.apache.org/jira/browse/MAHOUT-1708 Project: Mahout Issue Type: Bug Components: Hdfs, Math Affects Versions: 0.10.0 Environment: Spark Reporter: Pat Ferrel Assignee: Andrew Musselman Fix For: 0.10.1 all use of guava has been removed from the code used with Spark except the use of Preconditions. These are pretty easy to replace. 1) remove guava from mahout-math, mahout-hdfs, poms and the spark dependency-reduced assembly. 2) you will now get compile errors for math and hdfs so remove the imports and replace and Preconditions with asserts. Not sure how many errors in replacing these will be caught with unit tests so be careful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1707) Spark-itemsimilarity uses too much memory
[ https://issues.apache.org/jira/browse/MAHOUT-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1707: --- Description: java.lang.OutOfMemoryError: Java heap space The code has an unnecessary .collect(), forcing all interaction data into memory of the client/driver. Increasing the executor memory will not help with this. remove this line and rebuild Mahout. https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157 The errant line reads: interactions.collect() This forces the user action data into memory, a bad thing for memory consumption. Removing it should allow for better Spark memory management. was: java.lang.OutOfMemoryError: Java heap space The code has an unnecessary .collect(), forcing all interaction data into memory of the client/driver. Increasing the executor memory will not help with this. remove this line and rebuild Mahout. https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157 The errant line reads: interactions.collect() This forces the user action data into memory, a bad thing for memory consumption. > Spark-itemsimilarity uses too much memory > - > > Key: MAHOUT-1707 > URL: https://issues.apache.org/jira/browse/MAHOUT-1707 > Project: Mahout > Issue Type: Bug > Components: Collaborative Filtering, cooccurrence >Affects Versions: 0.10.0 > Environment: Spark >Reporter: Pat Ferrel >Assignee: Pat Ferrel > Fix For: 0.10.1 > > > java.lang.OutOfMemoryError: Java heap space > The code has an unnecessary .collect(), forcing all interaction data into > memory of the client/driver. Increasing the executor memory will not help > with this. > remove this line and rebuild Mahout. > https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157 > The errant line reads: > interactions.collect() > This forces the user action data into memory, a bad thing for memory > consumption. Removing it should allow for better Spark memory management. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1707) Spark-itemsimilarity uses too much memory
Pat Ferrel created MAHOUT-1707: -- Summary: Spark-itemsimilarity uses too much memory Key: MAHOUT-1707 URL: https://issues.apache.org/jira/browse/MAHOUT-1707 Project: Mahout Issue Type: Bug Components: Collaborative Filtering, cooccurrence Affects Versions: 0.10.0 Environment: Spark Reporter: Pat Ferrel Assignee: Pat Ferrel Fix For: 0.10.1 java.lang.OutOfMemoryError: Java heap space The code has an unnecessary .collect(), forcing all interaction data into memory of the client/driver. Increasing the executor memory will not help with this. remove this line and rebuild Mahout. https://github.com/apache/mahout/blob/mahout-0.10.x/spark/src/main/scala/org/apache/mahout/drivers/TextDelimitedReaderWriter.scala#L157 The errant line reads: interactions.collect() This forces the user action data into memory, a bad thing for memory consumption. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1689) Create a doc on how to write an app that uses Mahout as a lib
[ https://issues.apache.org/jira/browse/MAHOUT-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14502271#comment-14502271 ] Pat Ferrel commented on MAHOUT-1689: [~Andrew_Palumbo], I have the example almost ready, Do you have a page for it? Planning an Example to go with the, errr, example. It will be one of the github downloads into the Examples directory. Also will create an mscala as another way to run. > Create a doc on how to write an app that uses Mahout as a lib > - > > Key: MAHOUT-1689 > URL: https://issues.apache.org/jira/browse/MAHOUT-1689 > Project: Mahout > Issue Type: Documentation >Affects Versions: 0.10.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.11.0 > > > Create a doc on how to write an app that uses Mahout as a lib -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1685) Move Mahout shell to Spark 1.3+
[ https://issues.apache.org/jira/browse/MAHOUT-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492498#comment-14492498 ] Pat Ferrel commented on MAHOUT-1685: If you read between the lines of Sean's reply, he is saying none of it is meant to be a supported "API". Which I take to mean they give no indication of change or deprecation (rather obvious). They have no intent to make it public again so If we don't work around we'll have to petition for a supported API. Some obvious solutions, without looking too deeply: 1) create our own shell from the Scala REPL, maybe using the Spark's shell as a template. Pro is we depend on Scala REPL + Spark supported APIS. Downside is that this is a much bigger chunk of code than current shell. 2) Can we turn the shell into a .mscala type scala-as-script extension to the Spark shell? This would obviously require a lot of imports and the compile delay at every load. Upside is that it goes through supported APIs that are less likely to change. Downside is little control over initialization of the context and kryo. 3) Petition them to support the API we use. This is by far the easiest and seems like it might be worth writing a Jira in Spark if only to get their response. > Move Mahout shell to Spark 1.3+ > --- > > Key: MAHOUT-1685 > URL: https://issues.apache.org/jira/browse/MAHOUT-1685 > Project: Mahout > Issue Type: Improvement > Components: Mahout spark shell >Reporter: Pat Ferrel >Assignee: Dmitriy Lyubimov >Priority: Critical > Fix For: 0.11.0 > > Attachments: mahout-shell-spark-1.3-errors.txt > > > Building for Spark 1.3 we found several important APIS used by the shell are > now marked package private in Spark, making them inaccessible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1689) Create a doc on how to write an app that uses Mahout as a lib
[ https://issues.apache.org/jira/browse/MAHOUT-1689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491795#comment-14491795 ] Pat Ferrel commented on MAHOUT-1689: This will be an example of using cooccurrence on many inputs. The CLI supports only 2. Will try to do it as a project and as an mscala file > Create a doc on how to write an app that uses Mahout as a lib > - > > Key: MAHOUT-1689 > URL: https://issues.apache.org/jira/browse/MAHOUT-1689 > Project: Mahout > Issue Type: Documentation >Affects Versions: 0.10.0 >Reporter: Andrew Palumbo >Assignee: Pat Ferrel > Fix For: 0.11.0 > > > Create a doc on how to write an app that uses Mahout as a lib -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1685) Move Mahout shell to Spark 1.3+
[ https://issues.apache.org/jira/browse/MAHOUT-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491790#comment-14491790 ] Pat Ferrel commented on MAHOUT-1685: Should we ask Spark why this needs to be private? I wonder if [~sowen] knows? Sean, this is the Mahout extended Spark REPL, the APIs it needs are now private. > Move Mahout shell to Spark 1.3+ > --- > > Key: MAHOUT-1685 > URL: https://issues.apache.org/jira/browse/MAHOUT-1685 > Project: Mahout > Issue Type: Improvement > Components: Mahout spark shell >Reporter: Pat Ferrel >Assignee: Dmitriy Lyubimov >Priority: Critical > Fix For: 0.11.0 > > Attachments: mahout-shell-spark-1.3-errors.txt > > > Building for Spark 1.3 we found several important APIS used by the shell are > now marked package private in Spark, making them inaccessible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1685) Move Mahout shell to Spark 1.3+
[ https://issues.apache.org/jira/browse/MAHOUT-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491788#comment-14491788 ] Pat Ferrel commented on MAHOUT-1685: this is shell specific, we probably fixed the rest but since the shell doesn't compile we haven't tested other parts > Move Mahout shell to Spark 1.3+ > --- > > Key: MAHOUT-1685 > URL: https://issues.apache.org/jira/browse/MAHOUT-1685 > Project: Mahout > Issue Type: Improvement > Components: Mahout spark shell >Reporter: Pat Ferrel >Assignee: Dmitriy Lyubimov >Priority: Critical > Fix For: 0.11.0 > > Attachments: mahout-shell-spark-1.3-errors.txt > > > Building for Spark 1.3 we found several important APIS used by the shell are > now marked package private in Spark, making them inaccessible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1685) Move Mahout shell to Spark 1.3+
[ https://issues.apache.org/jira/browse/MAHOUT-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14491623#comment-14491623 ] Pat Ferrel commented on MAHOUT-1685: [~Andrew_Palumbo] can you attach the errors you saw here? IMO we really need to get the shell working, it's a big feature and the distros are already on 1.2. By the time we get 0.10.1 out they may be on 1.4. We definitely don't want to drop the shell. > Move Mahout shell to Spark 1.3+ > --- > > Key: MAHOUT-1685 > URL: https://issues.apache.org/jira/browse/MAHOUT-1685 > Project: Mahout > Issue Type: Bug > Components: Mahout spark shell >Affects Versions: 0.10.1 >Reporter: Pat Ferrel >Assignee: Dmitriy Lyubimov >Priority: Critical > Fix For: 0.10.1 > > > Building for Spark 1.3 we found several important APIS used by the shell are > now marked package private in Spark, making them inaccessible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1685) Move Mahout shell to Spark 1.3+
Pat Ferrel created MAHOUT-1685: -- Summary: Move Mahout shell to Spark 1.3+ Key: MAHOUT-1685 URL: https://issues.apache.org/jira/browse/MAHOUT-1685 Project: Mahout Issue Type: Bug Components: Mahout spark shell Affects Versions: 0.10.1 Reporter: Pat Ferrel Assignee: Dmitriy Lyubimov Priority: Critical Fix For: 0.10.1 Building for Spark 1.3 we found several important APIS used by the shell are now marked package private in Spark, making them inaccessible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1679) example script run-item-sim should work on hdfs as well as local
Pat Ferrel created MAHOUT-1679: -- Summary: example script run-item-sim should work on hdfs as well as local Key: MAHOUT-1679 URL: https://issues.apache.org/jira/browse/MAHOUT-1679 Project: Mahout Issue Type: Bug Components: Examples Affects Versions: 0.10.0 Reporter: Pat Ferrel Assignee: Pat Ferrel Priority: Minor Fix For: 0.10.1 mahout/examples/bin/run-item-sim does not run on a cluster or pseudo-cluster Spark + HDFS It prints a warning and how to run in cluster but should just work in either mode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1678) Hadoop 1 build broken
Pat Ferrel created MAHOUT-1678: -- Summary: Hadoop 1 build broken Key: MAHOUT-1678 URL: https://issues.apache.org/jira/browse/MAHOUT-1678 Project: Mahout Issue Type: Bug Components: Integration Affects Versions: 0.10.0 Reporter: Pat Ferrel Assignee: Suneel Marthi Priority: Blocker Fix For: 0.10.0 building for H1 got error below, which blocks build tests for H1 T E S T S --- Running org.apache.mahout.clustering.TestClusterDumper Running org.apache.mahout.clustering.TestClusterEvaluator Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.033 sec - in org.apache.mahout.clustering.TestClusterDumper Running org.apache.mahout.clustering.cdbw.TestCDbwEvaluator Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 10.089 sec - in org.apache.mahout.clustering.cdbw.TestCDbwEvaluator Running org.apache.mahout.cf.taste.impl.similarity.jdbc.MySQLJDBCInMemoryItemSimilarityTest Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.701 sec - in org.apache.mahout.cf.taste.impl.similarity.jdbc.MySQLJDBCInMemoryItemSimilarityTest Running org.apache.mahout.text.LuceneStorageConfigurationTest Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.903 sec - in org.apache.mahout.text.LuceneStorageConfigurationTest Running org.apache.mahout.text.LuceneSegmentInputSplitTest Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 3.552 sec <<< FAILURE! - in org.apache.mahout.text.LuceneSegmentInputSplitTest testGetSegment(org.apache.mahout.text.LuceneSegmentInputSplitTest) Time elapsed: 2.248 sec <<< ERROR! java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileSystem.newInstance(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/fs/FileSystem; at __randomizedtesting.SeedInfo.seed([B6AAF6EC1A001636:33AA49EC475E421B]:0) at org.apache.solr.store.hdfs.HdfsDirectory.(HdfsDirectory.java:58) at org.apache.mahout.text.LuceneSegmentInputSplit.getSegment(LuceneSegmentInputSplit.java:92) at org.apache.mahout.text.LuceneSegmentInputSplitTest.assertSegmentContainsOneDoc(LuceneSegmentInputSplitTest.java:81) at org.apache.mahout.text.LuceneSegmentInputSplitTest.testGetSegment(LuceneSegmentInputSplitTest.java:59) testGetSegmentNonExistingSegment(org.apache.mahout.text.LuceneSegmentInputSplitTest) Time elapsed: 0.958 sec <<< ERROR! java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileSystem.newInstance(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/fs/FileSystem; at __randomizedtesting.SeedInfo.seed([B6AAF6EC1A001636:F16E11692CC0C088]:0) at org.apache.solr.store.hdfs.HdfsDirectory.(HdfsDirectory.java:58) at org.apache.mahout.text.LuceneSegmentInputSplit.getSegment(LuceneSegmentInputSplit.java:92) at org.apache.mahout.text.LuceneSegmentInputSplitTest.testGetSegmentNonExistingSegment(LuceneSegmentInputSplitTest.java:76) Running org.apache.mahout.text.SequenceFilesFromLuceneStorageTest Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 29.904 sec - in org.apache.mahout.clustering.TestClusterEvaluator Running org.apache.mahout.text.LuceneSegmentRecordReaderTest Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed: 5.239 sec <<< FAILURE! - in org.apache.mahout.text.LuceneSegmentRecordReaderTest testNonExistingIdField(org.apache.mahout.text.LuceneSegmentRecordReaderTest) Time elapsed: 2.588 sec <<< ERROR! java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileSystem.newInstance(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/fs/FileSystem; at __randomizedtesting.SeedInfo.seed([BE4E63CDB556DEFF:25483164126E6A9]:0) at org.apache.solr.store.hdfs.HdfsDirectory.(HdfsDirectory.java:58) at org.apache.mahout.text.LuceneSegmentInputSplit.getSegment(LuceneSegmentInputSplit.java:92) at org.apache.mahout.text.LuceneSegmentRecordReader.initialize(LuceneSegmentRecordReader.java:55) at org.apache.mahout.text.LuceneSegmentRecordReaderTest.testNonExistingIdField(LuceneSegmentRecordReaderTest.java:93) testNonExistingField(org.apache.mahout.text.LuceneSegmentRecordReaderTest) Time elapsed: 1.188 sec <<< ERROR! java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileSystem.newInstance(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/fs/FileSystem; at __randomizedtesting.SeedInfo.seed([BE4E63CDB556DEFF:4252F6007B6F27B1]:0) at org.apache.solr.store.hdfs.HdfsDirectory.(HdfsDirectory.java:58) at org.apache.mahout.text.LuceneSegmentInputSplit.getSegment(LuceneSegmentInputSplit.java:92) at org.apache.mahout.text.LuceneSegmentRecordReader.initialize(LuceneSegmentRecordReader.java:55) at org.apache.ma
[jira] [Resolved] (MAHOUT-1674) A'A fails getting with an index out of range for a row vector
[ https://issues.apache.org/jira/browse/MAHOUT-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel resolved MAHOUT-1674. Resolution: Fixed Assignee: Pat Ferrel (was: Dmitriy Lyubimov) Made change to blas that catch this case, passes one user's test that I was able to reporduce. > A'A fails getting with an index out of range for a row vector > - > > Key: MAHOUT-1674 > URL: https://issues.apache.org/jira/browse/MAHOUT-1674 > Project: Mahout > Issue Type: Bug > Components: s >Affects Versions: 0.10.0 >Reporter: Pat Ferrel >Assignee: Pat Ferrel >Priority: Critical > Fix For: 0.10.0 > > > A'A and possibly A'B can fail with an index out of bounds on the row vector. > This seems related to partitioning where some partitions may be empty. > This can be reproduce with the attached data as input into > spark-itemsimilarity. This is only A data and the one large csv will complete > correctly but passing in the directory of part files will exhibit the error. > The data is identical except in the number of files that are used to contain > the data. > The error occurs using the local raw filesystem and with master = local and > is pretty fast to reach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1512) Hadoop 2 compatibility
[ https://issues.apache.org/jira/browse/MAHOUT-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396317#comment-14396317 ] Pat Ferrel commented on MAHOUT-1512: Was there work done recently I failed on 2.6 last Friday 2-3-2015. If someone has a known good install of 2.6 on a pseudo cluster or better I can provide a simple test. > Hadoop 2 compatibility > -- > > Key: MAHOUT-1512 > URL: https://issues.apache.org/jira/browse/MAHOUT-1512 > Project: Mahout > Issue Type: Task >Reporter: Sebastian Schelter >Assignee: Suneel Marthi >Priority: Critical > Labels: legacy, scala > Fix For: 0.10.0 > > > We must ensure that all our MR code also runs on Hadoop 2. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1588) Multiple input path support in recommendation job
[ https://issues.apache.org/jira/browse/MAHOUT-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1588: --- Resolution: Won't Fix Status: Resolved (was: Patch Available) > Multiple input path support in recommendation job > - > > Key: MAHOUT-1588 > URL: https://issues.apache.org/jira/browse/MAHOUT-1588 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Xiaomeng Huang >Assignee: Pat Ferrel >Priority: Minor > Labels: legacy > Fix For: 0.10.0 > > Attachments: Mahout-1588.000.patch > > > Now recommendation job can only import a input path via "--input", and can't > load file from different path. Customers may put preference data in different > path. This is a very usual scenario. > I add a option named "--multiInput(-mi)", and don't remove the original input > option. These two input option can set together. And the modification only > refer to PreparePreferenceMatrixJob, which load data from filesystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1588) Multiple input path support in recommendation job
[ https://issues.apache.org/jira/browse/MAHOUT-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396298#comment-14396298 ] Pat Ferrel commented on MAHOUT-1588: Does this work for all recommender CLIs? The new spark-itemsimilarity already have a flexible method for passing in multiple directories and files even supporting recursive regex discovery of input. This is too large for 0.10.0 and may not be of enough importance to test for later release. If the contributor feels this is important, please create a PR and include tests. > Multiple input path support in recommendation job > - > > Key: MAHOUT-1588 > URL: https://issues.apache.org/jira/browse/MAHOUT-1588 > Project: Mahout > Issue Type: New Feature > Components: Collaborative Filtering >Affects Versions: 0.9 >Reporter: Xiaomeng Huang >Assignee: Pat Ferrel >Priority: Minor > Labels: legacy > Fix For: 0.10.0 > > Attachments: Mahout-1588.000.patch > > > Now recommendation job can only import a input path via "--input", and can't > load file from different path. Customers may put preference data in different > path. This is a very usual scenario. > I add a option named "--multiInput(-mi)", and don't remove the original input > option. These two input option can set together. And the modification only > refer to PreparePreferenceMatrixJob, which load data from filesystem. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1674) A'A fails getting with an index out of range for a row vector
[ https://issues.apache.org/jira/browse/MAHOUT-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396290#comment-14396290 ] Pat Ferrel commented on MAHOUT-1674: [~dlie...@gmail.com] will not be able to fix this until 0.10.1 so [~pferrel] is looking for some fix guidance for a short term work around. The reason this is hard to ignore is that two users are gathering data with Spark streaming, which tends to create lots of small files, and they have run into this error. Kafka (or other) to Spark Streaming will be an increasingly popular method for input to cooccurrence calculation. The only known workaround is to concatenated input files before reading them into Mahout. This has been verified in only one case. > A'A fails getting with an index out of range for a row vector > - > > Key: MAHOUT-1674 > URL: https://issues.apache.org/jira/browse/MAHOUT-1674 > Project: Mahout > Issue Type: Bug > Components: s >Affects Versions: 0.10.0 >Reporter: Pat Ferrel >Assignee: Dmitriy Lyubimov >Priority: Critical > Fix For: 0.10.0 > > > A'A and possibly A'B can fail with an index out of bounds on the row vector. > This seems related to partitioning where some partitions may be empty. > This can be reproduce with the attached data as input into > spark-itemsimilarity. This is only A data and the one large csv will complete > correctly but passing in the directory of part files will exhibit the error. > The data is identical except in the number of files that are used to contain > the data. > The error occurs using the local raw filesystem and with master = local and > is pretty fast to reach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1674) A'A fails getting with an index out of range for a row vector
Pat Ferrel created MAHOUT-1674: -- Summary: A'A fails getting with an index out of range for a row vector Key: MAHOUT-1674 URL: https://issues.apache.org/jira/browse/MAHOUT-1674 Project: Mahout Issue Type: Bug Components: s Affects Versions: 0.10.0 Reporter: Pat Ferrel Assignee: Dmitriy Lyubimov Priority: Critical Fix For: 0.10.0 A'A and possibly A'B can fail with an index out of bounds on the row vector. This seems related to partitioning where some partitions may be empty. This can be reproduce with the attached data as input into spark-itemsimilarity. This is only A data and the one large csv will complete correctly but passing in the directory of part files will exhibit the error. The data is identical except in the number of files that are used to contain the data. The error occurs using the local raw filesystem and with master = local and is pretty fast to reach. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAHOUT-1655) Refactor module dependencies
[ https://issues.apache.org/jira/browse/MAHOUT-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel resolved MAHOUT-1655. Resolution: Fixed finished refactoring. OIUtils seems mostly anachronist. The only thing used currently in Scala must be the vector writable to vector conversion and that might be replace with a couple lines of Scala but the class is small so not a big deal. > Refactor module dependencies > > > Key: MAHOUT-1655 > URL: https://issues.apache.org/jira/browse/MAHOUT-1655 > Project: Mahout > Issue Type: Improvement > Components: mrlegacy >Affects Versions: 0.9 >Reporter: Pat Ferrel >Assignee: Andrew Musselman >Priority: Critical > Fix For: 0.10.0 > > > Make a new module, call it mahout-hadoop. Move anything there that is > currently in mrlegacy but used in math-scala or spark. Remove dependencies on > mrlegacy altogether if possible by using other core classes. > The goal is to have math-scala and spark module depend on math, and a small > module called mahout-hadoop (much smaller than mrlegacy). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MAHOUT-1646) Refactor out all possible mrlegacy dependencies from Scala code
[ https://issues.apache.org/jira/browse/MAHOUT-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel resolved MAHOUT-1646. Resolution: Duplicate duplicate of MAHOUT-1655 > Refactor out all possible mrlegacy dependencies from Scala code > --- > > Key: MAHOUT-1646 > URL: https://issues.apache.org/jira/browse/MAHOUT-1646 > Project: Mahout > Issue Type: Improvement > Components: build >Affects Versions: 0.9 >Reporter: Pat Ferrel >Assignee: Dmitriy Lyubimov > Fix For: 0.10.1 > > > Scala/Spark code depends on the mrlegacy module even though very few things > are really used. move those needed pieces to math so as to remove this > dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1662) Potential Path bug in SequenceFileVaultIterator breaks DisplaySpectralKMeans
[ https://issues.apache.org/jira/browse/MAHOUT-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14393418#comment-14393418 ] Pat Ferrel commented on MAHOUT-1662: I'm getting the "wrong FS" error with spark-itemsimilarity on hadoop 2.6 + spark 1.1.0 + yarn any relation? I have hdfs running, can see the input file with "hadoop fs -ls /input" and in the hadoop gui but get a wrong FS error when getting a file status in the code. > Potential Path bug in SequenceFileVaultIterator breaks DisplaySpectralKMeans > > > Key: MAHOUT-1662 > URL: https://issues.apache.org/jira/browse/MAHOUT-1662 > Project: Mahout > Issue Type: Bug > Components: Examples, mrlegacy >Affects Versions: 0.9 >Reporter: Shannon Quinn >Assignee: Shannon Quinn > Fix For: 0.10.0 > > > Received the following error when attempting to run DisplaySpectralKMeans: > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > file://tmp/calculations/diagonal/part-r-0/tmp/calculations/diagonal/part-r-0, > expected: file:/// > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645) > at > org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80) > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1750) > at > org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1774) > at > org.apache.mahout.common.iterator.sequencefile.SequenceFileValueIterator.(SequenceFileValueIterator.java:56) > at > org.apache.mahout.clustering.spectral.VectorCache.load(VectorCache.java:115) > at > org.apache.mahout.clustering.spectral.MatrixDiagonalizeJob.runJob(MatrixDiagonalizeJob.java:77) > at > org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:170) > at > org.apache.mahout.clustering.spectral.kmeans.SpectralKMeansDriver.run(SpectralKMeansDriver.java:117) > at > org.apache.mahout.clustering.display.DisplaySpectralKMeans.main(DisplaySpectralKMeans.java:76) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) > Tracked the origin of the bug to line 54 of SequenceFileVaultIterator. PR > which contains a fix is available; I would ask for independent verification > before merging it with master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1618) Cooccurrence Recommender example and documentation
[ https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391437#comment-14391437 ] Pat Ferrel commented on MAHOUT-1618: No, this is a full blown example of integration with Solr and is a fairly big project. The doc I was refering to is a simple actual quickstart and should be no more than a page. > Cooccurrence Recommender example and documentation > --- > > Key: MAHOUT-1618 > URL: https://issues.apache.org/jira/browse/MAHOUT-1618 > Project: Mahout > Issue Type: Documentation > Components: Examples >Affects Versions: cooccurrence >Reporter: Thejas Prasad >Assignee: Pat Ferrel >Priority: Trivial > Labels: DSL, cooccurence, scala, spark > Fix For: 0.10.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MAHOUT-1667) Support Hadoop 1.2.1 in poms
Pat Ferrel created MAHOUT-1667: -- Summary: Support Hadoop 1.2.1 in poms Key: MAHOUT-1667 URL: https://issues.apache.org/jira/browse/MAHOUT-1667 Project: Mahout Issue Type: Bug Components: build Affects Versions: 0.10.0 Reporter: Pat Ferrel Assignee: Suneel Marthi Priority: Critical Fix For: 0.10.0 Need to support build for Hadoop 1.2.1 with the hadoop1 profile in poms. Errors for non-existent artifacts appear when running: "mvn -Phadoop1 -Dhadoop.version=1.2.1 clean install" with hadoop-auth, which does not exist for hadoop 1.2.1, along with hadoop-yarn and several other artifacts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAHOUT-1618) Cooccurrence Recommender example and documentation
[ https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391156#comment-14391156 ] Pat Ferrel commented on MAHOUT-1618: skeleton code written, will have to be after 0.10.0 before it is added to the site. > Cooccurrence Recommender example and documentation > --- > > Key: MAHOUT-1618 > URL: https://issues.apache.org/jira/browse/MAHOUT-1618 > Project: Mahout > Issue Type: Documentation > Components: Examples >Affects Versions: cooccurrence >Reporter: Thejas Prasad >Assignee: Pat Ferrel >Priority: Trivial > Labels: DSL, cooccurence, scala, spark > Fix For: 0.10.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAHOUT-1589) mahout.cmd has duplicated content
[ https://issues.apache.org/jira/browse/MAHOUT-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat Ferrel updated MAHOUT-1589: --- Resolution: Fixed Status: Resolved (was: Patch Available) mahout.cmd prints a deprecation warning when run. > mahout.cmd has duplicated content > - > > Key: MAHOUT-1589 > URL: https://issues.apache.org/jira/browse/MAHOUT-1589 > Project: Mahout > Issue Type: Bug > Components: CLI >Affects Versions: 0.9 > Environment: Windows >Reporter: Venkat Ranganathan >Assignee: Pat Ferrel > Labels: legacy, scala > Fix For: 0.10.0 > > Attachments: MAHOUT-1589.patch > > > bin/mahout.cmd has duplicated contents. Need to trim it -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (MAHOUT-1618) Cooccurrence Recommender example and documentation
[ https://issues.apache.org/jira/browse/MAHOUT-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on MAHOUT-1618 started by Pat Ferrel. -- > Cooccurrence Recommender example and documentation > --- > > Key: MAHOUT-1618 > URL: https://issues.apache.org/jira/browse/MAHOUT-1618 > Project: Mahout > Issue Type: Documentation > Components: Examples >Affects Versions: cooccurrence >Reporter: Thejas Prasad >Assignee: Pat Ferrel >Priority: Trivial > Labels: DSL, cooccurence, scala, spark > Fix For: 0.10.1 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)