You have 256G of memory in each node machine partitioned to 16g per core?

If so you should have -sem 256g or a little less since that is how much memory 
per node to allocate. All cores of a node will share this memory.

The only unusual memory consideration is the dictionaries, which are broadcast 
to each node and shared by each task on the node during read and write. So 
there needs to be enough memory to store one copy of each dictionary per node. 
A dictionary is a bi-directional hashmap. This will be a max of one item and 
one user id  dictionaries that that are broadcast for the duration of the read 
and write tasks. If a problem is occurring during reading or writing it might 
be the dictionaries but with 256g per node this seems unlikely. How many users 
and items?


On Oct 13, 2014, at 2:30 AM, pol <swallow_p...@163.com> wrote:

Hi Pat,
        yes, I manually stopped it running, but there are some wrong, may be a 
configuration errors may be insufficient memory, I have to spark mailing lists 
for help.
        The spark-itemsimilarity another problem I consulting in separate mail. 
Thank you.


On Oct 11, 2014, at 09:22, Pat Ferrel <p...@occamsmachete.com> wrote:

> Did you stop the 1.6g job or did it fail?
> 
> I see task failures but no stage failures.
> 
> 
> On Oct 10, 2014, at 8:49 AM, pol <swallow_p...@163.com> wrote:
> 
> Hi Pat,
>       Yes, spark-itemsimilarity can be work ok, it had been finished 
> calculation on 150m dataset.
> 
>       The problem above, 1.6g dataset can’t be finishing calculation, I have 
> three machines(16 cores and 16g memory per) for this test, the environment 
> can't finish the calculation?
>       The dataset had archived one file by hadoop archive tool, such as only 
> a machine at processing state. To do so because don’t archive will be coming 
> some error, about information can refer to the attachment.
>       <spark1.png>
> 
> <spark2.png>
> 
> <spark3.png>
> 
> 
>       If you can, I will provide the test dataset to you. 
> 
>       Thank you again.
> 
> 
> On Oct 10, 2014, at 22:07, Pat Ferrel <p...@occamsmachete.com> wrote:
> 
>> So it is completing some of the spar-itemsimilarity jobs now? That is better 
>> at least.
>> 
>> Yes. More data means you may need more memory or more nodes in your cluster. 
>> This is how to scale Spark and Hadoop. Spark in particular needs core memory 
>> since it tries to avoid disk read/write.
>> 
>> Try increasing -sem as fas as you can first then you may need to add 
>> machines to your cluster tp speed it up. Do you need results faster than 15 
>> hours.
>> 
>> Remember the way the Solr recommender works allows you to make 
>> recommendations to new users and train less often. The new user data does no 
>> have to be in the training/indicator data. You train partly based on how 
>> many new user but partly based on how many new items are added to the 
>> catalog.
>> 
>> A\On Oct 10, 2014, at 1:47 AM, pol <swallow_p...@163.com> wrote:
>> 
>> Hi Pat,
>>      Because of a holiday, now just reply.
>> 
>>      I changed 1.0.2 to 1.0.1 for mahout-1.0-SNAPSHOT, and use Spark 1.0.1 , 
>> Hadoop 2.4.0, spark-itemsimilarity can be work ok. But have a new question:
>>      mahout spark-itemsimilarity -i /view_input,/purchase_input -o /output 
>> -os -ma spark://recommend1:7077 -sem 15g -f1 purchase -f2 view -ic 2 -fc 1 
>> -m 36
>> 
>>      When "view" data:1.6g and "purchase" data:60m, this shell 15 hours are 
>> not performed("indicator-matrix" had computed, and "cross-indicator-matrix" 
>> computing), but "view" data:100m finished 2 minutes to perform, this is the 
>> reason of data?
>> 
>> 
>> On Oct 1, 2014, at 01:10, Pat Ferrel <p...@occamsmachete.com> wrote:
>> 
>>> This will not be fixed in Mahout 1.0 unless we can find a problem in Mahout 
>>> now. I am the one who would fix it. At present it looks to me like a Spark 
>>> version or setup problem.
>>> 
>>> These errors seem to indicate that the build or setup have a problems. It 
>>> seems that you cannot use Spark 1.10. Set up your cluster to use 
>>> mahout-1.0-SNAPSHOT with pom set to back to spark-1.0.1, Spark 1.0.1 build 
>>> for Hadoop 2.4, and Hadoop 2.4. This is the only combination that is 
>>> supposed to work together.
>>> 
>>> If this still fails it may be a setup problems since I can run on a cluster 
>>> just fine with my setup. When you get an error from this config send it to 
>>> me and the Spark user list to see if they can give us a clue.
>>> 
>>> Question: Do you have mahout-1.0-SNAPSHOT and spark installed on all your 
>>> cluster machines, with the correct environment variables and path?
>>> 
>>> 
>>> On Sep 30, 2014, at 12:47 AM, pol <swallow_p...@163.com> wrote:
>>> 
>>> Hi Pat, 
>>>     It’s problem for Spark version, but spark-itemsimilarity is still can't 
>>> the completion of normal.
>>> 
>>> 1. Change 1.0.1 to 1.1.0 at mahout-1.0-SNAPSHOT/pom.xml, Spark version 
>>> compatibility is no problem, but the program has a problem:
>>> --------------------------------------------------------------
>>> 14/09/30 11:26:04 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 
>>> 10.1 (TID 31, Hadoop.Slave1): java.lang.NoClassDefFoundError:  
>>>       org/apache/commons/math3/random/RandomGenerator
>>>       org.apache.mahout.common.RandomUtils.getRandom(RandomUtils.java:65)
>>>       
>>> org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:228)
>>>       
>>> org.apache.mahout.math.cf.SimilarityAnalysis$$anonfun$3.apply(SimilarityAnalysis.scala:223)
>>>       
>>> org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:33)
>>>       
>>> org.apache.mahout.sparkbindings.blas.MapBlock$$anonfun$1.apply(MapBlock.scala:32)
>>>       scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>       scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>>       
>>> org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:235)
>>>       
>>> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163)
>>>       org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
>>>       org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
>>>       org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>>>       org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>       org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>       
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>       org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>       org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>       org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>>       org.apache.spark.scheduler.Task.run(Task.scala:54)
>>>       org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>>       
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>       
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>       java.lang.Thread.run(Thread.java:662)
>>> --------------------------------------------------------------
>>> I tried to add commons-math3-3.2.jar to mahout-1.0-SNAPSHOT/lib, but still 
>>> the same. (It not directly use the RandomGenerator at RandomUtils.java:65)
>>> 
>>> 
>>> 2. Change 1.0.1 to 1.0.2 at mahout-1.0-SNAPSHOT/pom.xml, there are still 
>>> other errors:
>>> --------------------------------------------------------------
>>> 14/09/30 14:36:57 WARN scheduler.TaskSetManager: Lost TID 427 (task 7.0:51)
>>> 14/09/30 14:36:57 WARN scheduler.TaskSetManager: Loss was due to 
>>> java.lang.ClassCastException
>>> java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Tuple2
>>>       at 
>>> org.apache.mahout.drivers.TDIndexedDatasetReader$$anonfun$4.apply(TextDelimitedReaderWriter.scala:75)
>>>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>       at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>       at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:59)
>>>       at 
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
>>>       at 
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
>>>       at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)
>>>       at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:594)
>>>       at 
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>>       at 
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>>>       at 
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>>>       at org.apache.spark.scheduler.Task.run(Task.scala:51)
>>>       at 
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>>>       at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>       at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>       at java.lang.Thread.run(Thread.java:662)
>>> --------------------------------------------------------------
>>> Please refer to the attachment for full log.
>>> <screenlog_bash.log>
>>> 
>>> 
>>> 
>>> In addition, I used 66 files on HDFS than each file in 20 to 30 M,  if it 
>>> is necessary I will provide the data.
>>> Shell is : mahout spark-itemsimilarity -i 
>>> /rec/input/ss/others,/rec/input/ss/weblog -o /rec/output/ss -os -ma 
>>> spark://recommend1:7077 -sem 4g -f1 purchase -f2 view -ic 2 -fc 1
>>> Spark cluster: 8 workers, 32 cores total, 32G memory total, at two machines.
>>> 
>>> Feeling a few days are not solved, not as good as waiting for Mahout 1.0 
>>> release version or use mahout item similarity.
>>> 
>>> 
>>> Thank you again, Pat.
>>> 
>>> 
>>> On Sep 29, 2014, at 00:02, Pat Ferrel <p...@occamsmachete.com> wrote:
>>> 
>>>> It looks like the cluster version of spark-itemsimilarity is never 
>>>> accepted by the Spark master. it fails in TextDelimitedReaderWriter.scala 
>>>> because all work is using “lazy” evaluation and until the write no actual 
>>>> work is done on the Spark cluster.
>>>> 
>>>> However your cluster seems to be working with the Pi example. Therefore 
>>>> there must be something wrong with the Mahout build or config. Some ideas:
>>>> 
>>>> 1) Mahout 1.0-SNAPSHOT is targeted for Spark 1.0.1.  However I use 1.0.2 
>>>> and it seems to work. You might try changing the version in the pom.xml 
>>>> and do a clean build of Mahout. Change the version number in mahout/pom.xml
>>>> 
>>>> mahout/pom.xml
>>>> -     <spark.version>1.0.1</spark.version>
>>>> +    <spark.version>1.1.0</spark.version>
>>>> 
>>>> This may not be needed but it is easier than installing Spark 1.0.1.
>>>> 
>>>> 2) Try installing and building Mahout on all cluster machines. I do this 
>>>> so I can run the Mahout spark-shell on any machine but it may be needed. 
>>>> The Mahout jars, path setup, and directory structure should be the same on 
>>>> all cluster machines.
>>>> 
>>>> 3) Try making -sem larger. I usually make it as large a I can on the 
>>>> cluster and try smaller until it affects performance. The epinions dataset 
>>>> that I use for testing on my cluster requires -sem 6g.
>>>> 
>>>> My cluster has 3 machines with Hadoop 1.2.1 and Spark 1.0.2.  I can try 
>>>> running your data through spark-itemsimilarity on my cluster if you can 
>>>> share it. I will sign an NDA and destroy it after the test.
>>>> 
>>>> 
>>>> 
>>>> On Sep 27, 2014, at 5:28 AM, pol <swallow_p...@163.com> wrote:
>>>> 
>>>> Hi Pat,
>>>>    Thank for your’s reply. It's still can't work normal, I tested it on a 
>>>> Spark standalone cluster, don’t tested it on a YARN cluster.
>>>> 
>>>> First, test the cluster configuration is correct. 
>>>> http:///Hadoop.Master:8080 infos:
>>>> -----------------------------------
>>>> URL: spark://Hadoop.Master:7077
>>>> Workers: 2
>>>> Cores: 4 Total, 0 Used
>>>> Memory: 2.0 GB Total, 0.0 B Used
>>>> Applications: 0 Running, 1 Completed
>>>> Drivers: 0 Running, 0 Completed
>>>> Status: ALIVE
>>>> ----------------------------------
>>>> 
>>>> Environment
>>>> ----------------------------------
>>>> OS: CentOS release 6.5 (Final)
>>>> JDK: 1.6.0_45
>>>> Mahout: mahout-1.0-SNAPSHOT(mvn -Dhadoop2.version=2.4.1 -DskipTests clean 
>>>> package)
>>>> Hadoop: 2.4.1
>>>> Spark: spark-1.1.0-bin-2.4.1(mvn -Pyarn -Phadoop-2.4 
>>>> -Dhadoop.version=2.4.1 -Phive -DskipTests clean package)
>>>> ----------------------------------
>>>> 
>>>> Shell:
>>>>    spark-submit --class org.apache.spark.examples.SparkPi --master 
>>>> spark://Hadoop.Master:7077 --executor-memory 1g --total-executor-cores 2 
>>>> /root/spark-examples_2.10-1.1.0.jar 1000
>>>> 
>>>> It’s work ok, a part of the log for the shell:
>>>> ----------------------------------
>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 995.0 in 
>>>> stage 0.0 (TID 995) in 17 ms on Hadoop.Slave1 (996/1000)
>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Starting task 998.0 in 
>>>> stage 0.0 (TID 998, Hadoop.Slave2, PROCESS_LOCAL, 1225 bytes)
>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 996.0 in 
>>>> stage 0.0 (TID 996) in 20 ms on Hadoop.Slave2 (997/1000)
>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Starting task 999.0 in 
>>>> stage 0.0 (TID 999, Hadoop.Slave1, PROCESS_LOCAL, 1225 bytes)
>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 997.0 in 
>>>> stage 0.0 (TID 997) in 27 ms on Hadoop.Slave1 (998/1000)
>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 998.0 in 
>>>> stage 0.0 (TID 998) in 31 ms on Hadoop.Slave2 (999/1000)
>>>> 14/09/19 19:48:00 INFO scheduler.TaskSetManager: Finished task 999.0 in 
>>>> stage 0.0 (TID 999) in 20 ms on Hadoop.Slave1 (1000/1000)
>>>> 14/09/19 19:48:00 INFO scheduler.DAGScheduler: Stage 0 (reduce at 
>>>> SparkPi.scala:35) finished in 25.109 s
>>>> 14/09/19 19:48:00 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, 
>>>> whose tasks have all completed, from pool
>>>> 14/09/19 19:48:00 INFO spark.SparkContext: Job finished: reduce at 
>>>> SparkPi.scala:35, took 26.156022565 s
>>>> Pi is roughly 3.14156112
>>>> ----------------------------------
>>>> 
>>>> Second, test spark-itemsimilarity on "local", it's work ok, shell:
>>>>    mahout spark-itemsimilarity -i /test/ss/input/data.txt -o 
>>>> /test/ss/output -os -ma local[2] -sem 512m -f1 purchase -f2 view -ic 2 -fc 
>>>> 1
>>>> 
>>>> Third, test spark-itemsimilarity on "cluster", shell:
>>>>    mahout spark-itemsimilarity -i /test/ss/input/data.txt -o 
>>>> /test/ss/output -os -ma spark://Hadoop.Master:7077 -sem 512m -f1 purchase 
>>>> -f2 view -ic 2 -fc 1
>>>> 
>>>> It’s can’t work, full logs:
>>>> ----------------------------------
>>>> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>> SLF4J: Found binding in 
>>>> [jar:file:/usr/mahout-1.0-SNAPSHOT/mrlegacy/target/mahout-mrlegacy-1.0-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>> SLF4J: Found binding in 
>>>> [jar:file:/usr/mahout-1.0-SNAPSHOT/spark/target/mahout-spark_2.10-1.0-SNAPSHOT-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>> SLF4J: Found binding in 
>>>> [jar:file:/usr/spark-1.1.0-bin-2.4.1/lib/spark-assembly-1.1.0-hadoop2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
>>>> explanation.
>>>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>>>> 14/09/19 20:31:07 INFO spark.SecurityManager: Changing view acls to: root
>>>> 14/09/19 20:31:07 INFO spark.SecurityManager: SecurityManager: 
>>>> authentication disabled; ui acls disabled; users with view permissions: 
>>>> Set(root)
>>>> 14/09/19 20:31:08 INFO slf4j.Slf4jLogger: Slf4jLogger started
>>>> 14/09/19 20:31:08 INFO Remoting: Starting remoting
>>>> 14/09/19 20:31:08 INFO Remoting: Remoting started; listening on addresses 
>>>> :[akka.tcp://spark@Hadoop.Master:47597]
>>>> 14/09/19 20:31:08 INFO Remoting: Remoting now listens on addresses: 
>>>> [akka.tcp://spark@Hadoop.Master:47597]
>>>> 14/09/19 20:31:08 INFO spark.SparkEnv: Registering MapOutputTracker
>>>> 14/09/19 20:31:08 INFO spark.SparkEnv: Registering BlockManagerMaster
>>>> 14/09/19 20:31:08 INFO storage.DiskBlockManager: Created local directory 
>>>> at /tmp/spark-local-20140919203108-e4e3
>>>> 14/09/19 20:31:08 INFO storage.MemoryStore: MemoryStore started with 
>>>> capacity 2.3 GB.
>>>> 14/09/19 20:31:08 INFO network.ConnectionManager: Bound socket to port 
>>>> 47186 with id = ConnectionManagerId(Hadoop.Master,47186)
>>>> 14/09/19 20:31:08 INFO storage.BlockManagerMaster: Trying to register 
>>>> BlockManager
>>>> 14/09/19 20:31:08 INFO storage.BlockManagerInfo: Registering block manager 
>>>> Hadoop.Master:47186 with 2.3 GB RAM
>>>> 14/09/19 20:31:08 INFO storage.BlockManagerMaster: Registered BlockManager
>>>> 14/09/19 20:31:08 INFO spark.HttpServer: Starting HTTP Server
>>>> 14/09/19 20:31:08 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>>> 14/09/19 20:31:08 INFO server.AbstractConnector: Started 
>>>> SocketConnector@0.0.0.0:41116
>>>> 14/09/19 20:31:08 INFO broadcast.HttpBroadcast: Broadcast server started 
>>>> at http://192.168.204.128:41116
>>>> 14/09/19 20:31:08 INFO spark.HttpFileServer: HTTP File server directory is 
>>>> /tmp/spark-10744709-bbeb-4d79-8bfe-d64d77799fb3
>>>> 14/09/19 20:31:08 INFO spark.HttpServer: Starting HTTP Server
>>>> 14/09/19 20:31:08 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>>> 14/09/19 20:31:08 INFO server.AbstractConnector: Started 
>>>> SocketConnector@0.0.0.0:59137
>>>> 14/09/19 20:31:09 INFO server.Server: jetty-8.y.z-SNAPSHOT
>>>> 14/09/19 20:31:09 INFO server.AbstractConnector: Started 
>>>> SelectChannelConnector@0.0.0.0:4040
>>>> 14/09/19 20:31:09 INFO ui.SparkUI: Started SparkUI at 
>>>> http://Hadoop.Master:4040
>>>> 14/09/19 20:31:10 WARN util.NativeCodeLoader: Unable to load native-hadoop 
>>>> library for your platform... using builtin-java classes where applicable
>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR 
>>>> /usr/mahout-1.0-SNAPSHOT/math-scala/target/mahout-math-scala_2.10-1.0-SNAPSHOT.jar
>>>>  at 
>>>> http://192.168.204.128:59137/jars/mahout-math-scala_2.10-1.0-SNAPSHOT.jar 
>>>> with timestamp 1411129870562
>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR 
>>>> /usr/mahout-1.0-SNAPSHOT/mrlegacy/target/mahout-mrlegacy-1.0-SNAPSHOT.jar 
>>>> at http://192.168.204.128:59137/jars/mahout-mrlegacy-1.0-SNAPSHOT.jar with 
>>>> timestamp 1411129870588
>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR 
>>>> /usr/mahout-1.0-SNAPSHOT/math/target/mahout-math-1.0-SNAPSHOT.jar at 
>>>> http://192.168.204.128:59137/jars/mahout-math-1.0-SNAPSHOT.jar with 
>>>> timestamp 1411129870612
>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR 
>>>> /usr/mahout-1.0-SNAPSHOT/spark/target/mahout-spark_2.10-1.0-SNAPSHOT.jar 
>>>> at http://192.168.204.128:59137/jars/mahout-spark_2.10-1.0-SNAPSHOT.jar 
>>>> with timestamp 1411129870618
>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR 
>>>> /usr/mahout-1.0-SNAPSHOT/math-scala/target/mahout-math-scala_2.10-1.0-SNAPSHOT.jar
>>>>  at 
>>>> http://192.168.204.128:59137/jars/mahout-math-scala_2.10-1.0-SNAPSHOT.jar 
>>>> with timestamp 1411129870620
>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR 
>>>> /usr/mahout-1.0-SNAPSHOT/mrlegacy/target/mahout-mrlegacy-1.0-SNAPSHOT.jar 
>>>> at http://192.168.204.128:59137/jars/mahout-mrlegacy-1.0-SNAPSHOT.jar with 
>>>> timestamp 1411129870631
>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR 
>>>> /usr/mahout-1.0-SNAPSHOT/math/target/mahout-math-1.0-SNAPSHOT.jar at 
>>>> http://192.168.204.128:59137/jars/mahout-math-1.0-SNAPSHOT.jar with 
>>>> timestamp 1411129870644
>>>> 14/09/19 20:31:10 INFO spark.SparkContext: Added JAR 
>>>> /usr/mahout-1.0-SNAPSHOT/spark/target/mahout-spark_2.10-1.0-SNAPSHOT.jar 
>>>> at http://192.168.204.128:59137/jars/mahout-spark_2.10-1.0-SNAPSHOT.jar 
>>>> with timestamp 1411129870647
>>>> 14/09/19 20:31:10 INFO client.AppClient$ClientActor: Connecting to master 
>>>> spark://Hadoop.Master:7077...
>>>> 14/09/19 20:31:13 INFO storage.MemoryStore: ensureFreeSpace(86126) called 
>>>> with curMem=0, maxMem=2491102003
>>>> 14/09/19 20:31:13 INFO storage.MemoryStore: Block broadcast_0 stored as 
>>>> values to memory (estimated size 84.1 KB, free 2.3 GB)
>>>> 14/09/19 20:31:13 INFO mapred.FileInputFormat: Total input paths to 
>>>> process : 1
>>>> 14/09/19 20:31:13 INFO spark.SparkContext: Starting job: collect at 
>>>> TextDelimitedReaderWriter.scala:74
>>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Registering RDD 7 (distinct 
>>>> at TextDelimitedReaderWriter.scala:74)
>>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Got job 0 (collect at 
>>>> TextDelimitedReaderWriter.scala:74) with 2 output partitions 
>>>> (allowLocal=false)
>>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Final stage: Stage 
>>>> 0(collect at TextDelimitedReaderWriter.scala:74)
>>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Parents of final stage: 
>>>> List(Stage 1)
>>>> 14/09/19 20:31:13 INFO scheduler.DAGScheduler: Missing parents: List(Stage 
>>>> 1)
>>>> 14/09/19 20:31:14 INFO scheduler.DAGScheduler: Submitting Stage 1 
>>>> (MapPartitionsRDD[7] at distinct at TextDelimitedReaderWriter.scala:74), 
>>>> which has no missing parents
>>>> 14/09/19 20:31:14 INFO scheduler.DAGScheduler: Submitting 2 missing tasks 
>>>> from Stage 1 (MapPartitionsRDD[7] at distinct at 
>>>> TextDelimitedReaderWriter.scala:74)
>>>> 14/09/19 20:31:14 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 
>>>> with 2 tasks
>>>> 14/09/19 20:31:29 WARN scheduler.TaskSchedulerImpl: Initial job has not 
>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>> registered and have sufficient memory
>>>> 14/09/19 20:31:30 INFO client.AppClient$ClientActor: Connecting to master 
>>>> spark://Hadoop.Master:7077...
>>>> 14/09/19 20:31:44 WARN scheduler.TaskSchedulerImpl: Initial job has not 
>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>> registered and have sufficient memory
>>>> 14/09/19 20:31:50 INFO client.AppClient$ClientActor: Connecting to master 
>>>> spark://Hadoop.Master:7077...
>>>> 14/09/19 20:31:59 WARN scheduler.TaskSchedulerImpl: Initial job has not 
>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>> registered and have sufficient memory
>>>> 14/09/19 20:32:10 ERROR cluster.SparkDeploySchedulerBackend: Application 
>>>> has been killed. Reason: All masters are unresponsive! Giving up.
>>>> 14/09/19 20:32:10 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, 
>>>> whose tasks have all completed, from pool
>>>> 14/09/19 20:32:10 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1
>>>> 14/09/19 20:32:10 INFO scheduler.DAGScheduler: Failed to run collect at 
>>>> TextDelimitedReaderWriter.scala:74
>>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted 
>>>> due to stage failure: All masters are unresponsive! Giving up.
>>>> at 
>>>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>> at 
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>> at 
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>> at 
>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>> at 
>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>> at 
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>> at 
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>> at scala.Option.foreach(Option.scala:236)
>>>> at 
>>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>> at 
>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>> at 
>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>> at 
>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>> at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>> at 
>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/metrics/json,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/static,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/executors/json,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/executors,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/environment/json,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/environment,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/storage/rdd,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/storage/json,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/storage,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/stages/pool/json,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/stages/pool,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/stages/stage/json,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/stages/stage,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/stages/json,null}
>>>> 14/09/19 20:32:10 INFO handler.ContextHandler: stopped 
>>>> o.e.j.s.ServletContextHandler{/stages,null}
>>>> ----------------------------------
>>>> 
>>>> Thanks.
>>>> 
>>>> 
>>>> 
>>>> On Sep 27, 2014, at 01:05, Pat Ferrel <p...@occamsmachete.com> wrote:
>>>> 
>>>>> Any luck with this?
>>>>> 
>>>>> If not could you send a full stack trace and check on the cluster 
>>>>> machines for other logs that might help.
>>>>> 
>>>>> 
>>>>> On Sep 25, 2014, at 6:34 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
>>>>> 
>>>>> Looks like a Spark error as far as I can tell. This error is very generic 
>>>>> and indicates that the job was not accepted for execution so Spark may be 
>>>>> configured wrong. This looks like a question for the Spark people
>>>>> 
>>>>> My Spark sanity check:
>>>>> 
>>>>> 1)  In the Spark UI at  http:///Hadoop.Master:8080 does everything look 
>>>>> correct?
>>>>> 2) Have you tested your spark *cluster* with one of their examples? Have 
>>>>> you run *any non-Mahout* code on the cluster to check that it is 
>>>>> configured properly? 
>>>>> 3) Are you using exactly the same Spark and Hadoop locally as on the 
>>>>> cluster? 
>>>>> 4) Did you launch both local and cluster jobs from the same cluster 
>>>>> machine? The only difference being the master URL (local[2] vs. 
>>>>> spark://Hadoop.Master:7077)?
>>>>> 
>>>>> 14/09/22 04:12:47 WARN scheduler.TaskSchedulerImpl: Initial job has not 
>>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>>> registered and have sufficient memory
>>>>> 14/09/22 04:12:49 INFO client.AppClient$ClientActor: Connecting to master 
>>>>> spark://Hadoop.Master:7077...
>>>>> 
>>>>> 
>>>>> On Sep 24, 2014, at 8:18 PM, pol <swallow_p...@163.com> wrote:
>>>>> 
>>>>> Hi, Pat
>>>>>   Dataset is the same, and the data is very few for test. This is a bug?
>>>>> 
>>>>> 
>>>>> On Sep 25, 2014, at 02:57, Pat Ferrel <pat.fer...@gmail.com> wrote:
>>>>> 
>>>>>> Are you using different data sets on the local and cluster?
>>>>>> 
>>>>>> Try increasing spark memory with -sem, I use -sem 6g for the epinions 
>>>>>> data set.
>>>>>> 
>>>>>> The ID dictionaries are kept in-memory on each cluster machine so a 
>>>>>> large number of user or item IDs will need more memory.
>>>>>> 
>>>>>> 
>>>>>> On Sep 24, 2014, at 9:31 AM, pol <swallow_p...@163.com> wrote:
>>>>>> 
>>>>>> Hi, All
>>>>>>  
>>>>>>  I’m sure it’s ok that launching Spark standalone to a cluster, but it 
>>>>>> can’t work used for spark-itemsimilarity.
>>>>>> 
>>>>>>  Launching on 'local' it’s ok:
>>>>>> mahout spark-itemsimilarity -i /user/root/test/input/data.txt -o 
>>>>>> /user/root/test/output -os -ma local[2] -f1 purchase -f2 view -ic 2 -fc 
>>>>>> 1 -sem 1g
>>>>>> 
>>>>>>  but launching on a standalone cluster will be an error:
>>>>>> mahout spark-itemsimilarity -i /user/root/test/input/data.txt -o 
>>>>>> /user/root/test/output -os -ma spark://Hadoop.Master:7077 -f1 purchase 
>>>>>> -f2 view -ic 2 -fc 1 -sem 1g
>>>>>> ------------
>>>>>> 14/09/22 04:12:47 WARN scheduler.TaskSchedulerImpl: Initial job has not 
>>>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>>>> registered and have sufficient memory
>>>>>> 14/09/22 04:12:49 INFO client.AppClient$ClientActor: Connecting to 
>>>>>> master spark://Hadoop.Master:7077...
>>>>>> 14/09/22 04:13:02 WARN scheduler.TaskSchedulerImpl: Initial job has not 
>>>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>>>> registered and have sufficient memory
>>>>>> 14/09/22 04:13:09 INFO client.AppClient$ClientActor: Connecting to 
>>>>>> master spark://Hadoop.Master:7077...
>>>>>> 14/09/22 04:13:17 WARN scheduler.TaskSchedulerImpl: Initial job has not 
>>>>>> accepted any resources; check your cluster UI to ensure that workers are 
>>>>>> registered and have sufficient memory
>>>>>> 14/09/22 04:13:29 ERROR cluster.SparkDeploySchedulerBackend: Application 
>>>>>> has been killed. Reason: All masters are unresponsive! Giving up.
>>>>>> 14/09/22 04:13:29 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, 
>>>>>> whose tasks have all completed, from pool 
>>>>>> 14/09/22 04:13:29 INFO scheduler.TaskSchedulerImpl: Cancelling stage 1
>>>>>> 14/09/22 04:13:29 INFO scheduler.DAGScheduler: Failed to run collect at 
>>>>>> TextDelimitedReaderWriter.scala:74
>>>>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted 
>>>>>> due to stage failure: All masters are unresponsive! Giving up.
>>>>>>  at 
>>>>>> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>>>>>>  at 
>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>>>>>>  at 
>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>>>>>>  at 
>>>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>>>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>>>  at 
>>>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>>>>>>  at 
>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>  at 
>>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>>>>>>  at scala.Option.foreach(Option.scala:236)
>>>>>>  at 
>>>>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>>>>>>  at 
>>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>>>>>>  at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>>>>>  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>>>>>  at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>>>>>  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>>>>>  at 
>>>>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>>>>>  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>>>>>  at 
>>>>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>>>>>  at 
>>>>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>>>>>  at 
>>>>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>>>> ------------
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 



Reply via email to