[ https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972210#comment-13972210 ]
Pat Ferrel commented on MAHOUT-1464: ------------------------------------ Well here's something I noticed that may be a clue. First there were some scala 2.10 jars that were built for hadoop 1.0.4 sitting in an assembly/target dir. So even though the managed_lib dir had the correct 1.2.1 version of hadoop I rebuilt and go rid of any 1.0.4 jars I could find. Then if I am running hadoop and mahout locally I can launch the Spark shell, where it creates a default context called sc. I can then perform the following: Created spark context.. Spark context available as sc. scala> val textFile = sc.textFile("xrsj/ratings_data.txt") 14/04/16 18:19:52 INFO storage.MemoryStore: ensureFreeSpace(61374) called with curMem=0, maxMem=318111744 14/04/16 18:19:52 INFO storage.MemoryStore: Block broadcast_0 stored as values to memory (estimated size 59.9 KB, free 303.3 MB) textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:12 scala> textFile.count() 14/04/16 18:19:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable ... 14/04/16 18:20:00 INFO spark.SparkContext: Job finished: count at <console>:15, took 0.995702 s res0: Long = 664824 If I exit the spark shell, start a local pseudo cluster then start the shell the same code works, only reading from the hdfs pseudo-cluster. The same exact code works for the cluster too since the file is at the same location relative to where I start the shell. I can also address the file in absolute terms with the line below. Notice I need to use the port # and leaving if off leads to the failure to connect message in a previous comment. scala> val textFile = sc.textFile("hdfs://occam4:54310/user/pat/xrsj/ratings_data.txt") scala> textFile.count() I tried using the port # in the cooccurrence test case but get the same failure to connect message. Since the Spark Scala shell is creating the context by detecting the machine's HDFS setup, could this be the problem in the IDEA running cooccurrence example? The context in the example is setup after the input is read from HDFS is that correct? I know it is not supposed to care about HDFS, only the Spark master but obviously when a context is created by the Spark Shell and is using the context to get the text file it works. Should we be doing that in the example? I'll try playing with how the text file is read and where the context is created. Perhaps naively I would have thought that the URI I used for read and write would bypass any default settings in the context but this doesn't seem to be true does it? I suspect something in the context or lack of it is causing Spark to be confused about where to read and write from, no matter how explicit the URI. > Cooccurrence Analysis on Spark > ------------------------------ > > Key: MAHOUT-1464 > URL: https://issues.apache.org/jira/browse/MAHOUT-1464 > Project: Mahout > Issue Type: Improvement > Components: Collaborative Filtering > Environment: hadoop, spark > Reporter: Pat Ferrel > Assignee: Sebastian Schelter > Fix For: 1.0 > > Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, > MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh > > > Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that > runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM > can be used as input. > Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has > several applications including cross-action recommendations. -- This message was sent by Atlassian JIRA (v6.2#6252)