[ 
https://issues.apache.org/jira/browse/MAHOUT-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972210#comment-13972210
 ] 

Pat Ferrel commented on MAHOUT-1464:
------------------------------------

Well here's something I noticed that may be a clue. First there were some scala 
2.10 jars that were built for hadoop 1.0.4 sitting in an assembly/target dir. 
So even though the managed_lib dir had the correct 1.2.1 version of hadoop I 
rebuilt and go rid of any 1.0.4 jars I could find.

Then if I am running hadoop and mahout locally I can launch the Spark shell, 
where it creates a default context called sc. I can then perform the following:

Created spark context..
Spark context available as sc.

scala>  val textFile = sc.textFile("xrsj/ratings_data.txt")
14/04/16 18:19:52 INFO storage.MemoryStore: ensureFreeSpace(61374) called with 
curMem=0, maxMem=318111744
14/04/16 18:19:52 INFO storage.MemoryStore: Block broadcast_0 stored as values 
to memory (estimated size 59.9 KB, free 303.3 MB)
textFile: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at 
<console>:12

scala> textFile.count()
14/04/16 18:19:59 WARN util.NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
...
14/04/16 18:20:00 INFO spark.SparkContext: Job finished: count at <console>:15, 
took 0.995702 s
res0: Long = 664824

If I exit the spark shell, start a local pseudo cluster then start the shell 
the same code works, only reading from the hdfs pseudo-cluster. The same exact 
code works for the cluster too since the file is at the same location relative 
to where I start the shell.

I can also address the file in absolute terms with the line below. Notice I 
need to use the port # and leaving if off leads to the failure to connect 
message in a previous comment. 

scala>  val textFile = 
sc.textFile("hdfs://occam4:54310/user/pat/xrsj/ratings_data.txt")
scala> textFile.count()

I tried using the port # in the cooccurrence test case but get the same failure 
to connect message.

Since the Spark Scala shell is creating the context by detecting the machine's 
HDFS setup, could this be the problem in the IDEA running cooccurrence example? 
The context in the example is setup after the input is read from HDFS is that 
correct? I know it is not supposed to care about HDFS, only the Spark master 
but obviously when a context is created by the Spark Shell and is using the 
context to get the text file it works.  Should we be doing that in the example? 
I'll try playing with how the text file is read and where the context is 
created. 

Perhaps naively I would have thought that the URI I used for read and write 
would bypass any default settings in the context but this doesn't seem to be 
true does it? I suspect something in the context or lack of it is causing Spark 
to be confused about where to read and write from, no matter how explicit the 
URI. 
 

> Cooccurrence Analysis on Spark
> ------------------------------
>
>                 Key: MAHOUT-1464
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1464
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>         Environment: hadoop, spark
>            Reporter: Pat Ferrel
>            Assignee: Sebastian Schelter
>             Fix For: 1.0
>
>         Attachments: MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, 
> MAHOUT-1464.patch, MAHOUT-1464.patch, MAHOUT-1464.patch, run-spark-xrsj.sh
>
>
> Create a version of Cooccurrence Analysis (RowSimilarityJob with LLR) that 
> runs on Spark. This should be compatible with Mahout Spark DRM DSL so a DRM 
> can be used as input. 
> Ideally this would extend to cover MAHOUT-1422. This cross-cooccurrence has 
> several applications including cross-action recommendations. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to