Re: MLlib Spam example gets stuck in Stage X

Su She Mon, 30 Mar 2015 16:56:38 -0700

Thank you for updating the files Holden! I actually was using that
same text in my files located on HDFS. Could the files being located
on HDFS be the reason why the example gets stuck? I c/p the code
provided on github, the only things I changed were:


a) file paths to: val spam = sc.textFile("hdfs://ip-...")

b) Shortened ham to 9 lines, and set numFeatures to 9 (also tried out 100).

c) added 3 count statements

The program outputs:

features in spam: 9 (spamFeatures.count())
features in ham: 9 (hamFeatures.count())
features in training data: 18 (trainingData.count())

and then gets stuck at :"count at DataValidators.scala:38" (as see on Web UI)

The completed jobs look like this:

Completed Jobs(4)
count at 21 (spamFeatures)
count at 23 (hamFeatures)
count at 28 (trainingData.count())
first at GeneralizedLinearAlgorithm at 34 (val model =
lrLearner.run(trainingData))

Is there anyway I can test to see if this is a problem with my Spark
setup? Thanks!

Su



On Mon, Mar 30, 2015 at 12:10 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>
>
> Thanks for pointing that out, I've updated the ham & spam example files, they 
> should be good from master currently.
>
> On Mon, Mar 30, 2015 at 10:16 AM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> +Holden, Joseph
>>
>> It seems that there is something wrong with the sample data file: 
>> https://github.com/databricks/learning-spark/blob/master/files/ham.txt
>>
>> -Xiangrui
>>
>> On Fri, Mar 27, 2015 at 1:03 PM, Su She <suhsheka...@gmail.com> wrote:
>>>
>>> Hello Xiangrui,
>>>
>>> Hmm, yes I have run other Spark (word count, spark streaming/kafka, etc) 
>>> examples locally, the same way I'm trying to run this MLlib example (i've 
>>> tried local[2] and local [4]).
>>>
>>> 1) I did trainingData.count() and the job was completed. The output was 
>>> 2...should this only be 2 or 400 (since each text file has 200 words)?
>>>
>>> 2) I noticed the code says: val trainingData = positiveExamples ++ 
>>> negativeExamples
>>>
>>> I'm not very familiar with scala, but the ++ sign seems weird to me, but 
>>> when I tried to only have one + sign, it did not build
>>>
>>> 3) I found a similar thread 
>>> here...http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3ccafrxrqf6drxlcsb7q1-w1puayadpnb8womwiwe_8++okq2c...@mail.gmail.com%3E
>>>
>>> it looks like Emily had the same problem (count at 
>>> DataValidators.scala:38), but doesn't seem like a solution was found. Also, 
>>> I don't get any of those errors printed to the console.
>>>
>>> 4) sorry, not sure what else to say, as this is a pretty basic example. 
>>> thank you for the help!
>>>
>>> best,
>>>
>>> Su
>>>
>>> On Fri, Mar 27, 2015 at 11:23 AM, Xiangrui Meng <men...@gmail.com> wrote:
>>>>
>>>> Hi Su,
>>>>
>>>> I'm not sure what the problem is. Did you try other Spark examples on your 
>>>> cluster? Did they work? Could you try
>>>>
>>>> trainingData.count()
>>>>
>>>> before calling lrLearn.run()? Just want to check whether this is an MLlib 
>>>> issue.
>>>>
>>>> Thanks,
>>>> Xiangrui
>>>>
>>>> On Wed, Mar 25, 2015 at 3:27 PM, Su She <suhsheka...@gmail.com> wrote:
>>>>>
>>>>> Hello Everyone,
>>>>>
>>>>> I was hoping to see if anyone has any additional thoughts on this as I 
>>>>> was able to find barely anything related to this error online (something 
>>>>> related to dependencies/breeze?)...thank you!
>>>>>
>>>>> Best,
>>>>>
>>>>> Su
>>>>>
>>>>> On Thu, Mar 19, 2015 at 10:54 AM, Su She <suhsheka...@gmail.com> wrote:
>>>>>>
>>>>>> Hello Akhil,
>>>>>>
>>>>>> I tried running it in an application, and I got the same result. The app 
>>>>>> gets stuck in Stage 1 at MLlib.scala at line 32 which in my app 
>>>>>> corresponds to: val model = lrLearner.run(trainingData).
>>>>>>
>>>>>> These are the details:
>>>>>>
>>>>>> org.apache.spark.rdd.RDD.count(RDD.scala:910)
>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>>> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
>>>>>> scala.collection.immutable.List.forall(List.scala:84)
>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
>>>>>> MLlib$.main(MLlib.scala:32)
>>>>>> MLlib.main(MLlib.scala)
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>> java.lang.reflect.Method.invoke(Method.java:606)
>>>>>> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>>>>>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>>>>> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>>>>
>>>>>>
>>>>>> Thank you for the help Akhil!
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Su
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 19, 2015 at 1:27 AM, Akhil Das <ak...@sigmoidanalytics.com> 
>>>>>> wrote:
>>>>>>>
>>>>>>> It seems its stuck at doing a count? What happening at line 38? I'm not 
>>>>>>> seeing count operation in this code  anywhere 
>>>>>>> https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48
>>>>>>>
>>>>>>> Thanks
>>>>>>> Best Regards
>>>>>>>
>>>>>>> On Thu, Mar 19, 2015 at 1:32 PM, Su She <suhsheka...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hello Akhil,
>>>>>>>>
>>>>>>>> Thanks for the info! Here is my UI...I am not sure what to make of the 
>>>>>>>> information here:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Details of active stage:
>>>>>>>>
>>>>>>>> org.apache.spark.rdd.RDD.count(RDD.scala:910)
>>>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
>>>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
>>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>>>>> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
>>>>>>>> scala.collection.immutable.List.forall(List.scala:84)
>>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161)
>>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146)
>>>>>>>> $line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
>>>>>>>> $line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38)
>>>>>>>> $line21.$read$$iwC$$iwC.<init>(<console>:40)
>>>>>>>> $line21.$read$$iwC.<init>(<console>:42)
>>>>>>>> $line21.$read.<init>(<console>:44)
>>>>>>>> $line21.$read$.<init>(<console>:48)
>>>>>>>> $line21.$read$.<clinit>(<console>)
>>>>>>>> $line21.$eval$.<init>(<console>:7)
>>>>>>>> $line21.$eval$.<clinit>(<console>)
>>>>>>>> $line21.$eval.$print(<console>)
>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you for the help Akhil!
>>>>>>>>
>>>>>>>> -Su
>>>>>>>>
>>>>>>>> On Thu, Mar 19, 2015 at 12:49 AM, Akhil Das 
>>>>>>>> <ak...@sigmoidanalytics.com> wrote:
>>>>>>>>>
>>>>>>>>> To get these metrics out, you need to open the driver ui running on 
>>>>>>>>> port 4040. And in there you will see Stages information and for each 
>>>>>>>>> stage you can see how much time it is spending on GC etc. In your 
>>>>>>>>> case, the parallelism seems 4, the more # of parallelism the more # 
>>>>>>>>> of tasks you will see.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Best Regards
>>>>>>>>>
>>>>>>>>> On Thu, Mar 19, 2015 at 1:15 PM, Su She <suhsheka...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Akhil,
>>>>>>>>>>
>>>>>>>>>> 1) How could I see how much time it is spending on stage 1? Or what 
>>>>>>>>>> if, like above, it doesn't get past stage 1?
>>>>>>>>>>
>>>>>>>>>> 2) How could I check if its a GC time? and where would I increase 
>>>>>>>>>> the parallelism for the model? I have a Spark Master and 2 Workers 
>>>>>>>>>> running on CDH 5.3...what would the default spark-shell level of 
>>>>>>>>>> parallelism be...I thought it would be 3?
>>>>>>>>>>
>>>>>>>>>> Thank you for the help!
>>>>>>>>>>
>>>>>>>>>> -Su
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 19, 2015 at 12:32 AM, Akhil Das 
>>>>>>>>>> <ak...@sigmoidanalytics.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Can you see where exactly it is spending time? Like you said it 
>>>>>>>>>>> goes to Stage 2, then you will be able to see how much time it 
>>>>>>>>>>> spend on Stage 1. See if its a GC time, then try increasing the 
>>>>>>>>>>> level of parallelism or repartition it like 
>>>>>>>>>>> sc.getDefaultParallelism*3.
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Best Regards
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 19, 2015 at 12:15 PM, Su She <suhsheka...@gmail.com> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hello Everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I am trying to run this MLlib example from Learning Spark:
>>>>>>>>>>>> https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48
>>>>>>>>>>>>
>>>>>>>>>>>> Things I'm doing differently:
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Using spark shell instead of an application
>>>>>>>>>>>>
>>>>>>>>>>>> 2) instead of their spam.txt and normal.txt I have text files with 
>>>>>>>>>>>> 3700 and 2700 words...nothing huge at all and just plain text
>>>>>>>>>>>>
>>>>>>>>>>>> 3) I've used numFeatures = 100, 1000 and 10,000
>>>>>>>>>>>>
>>>>>>>>>>>> Error: I keep getting stuck when I try to run the model:
>>>>>>>>>>>>
>>>>>>>>>>>> val model = new LogisticRegressionWithSGD().run(trainingData)
>>>>>>>>>>>>
>>>>>>>>>>>> It will freeze on something like this:
>>>>>>>>>>>>
>>>>>>>>>>>> [Stage 1:==============>                                           
>>>>>>>>>>>>  (1 + 0) / 4]
>>>>>>>>>>>>
>>>>>>>>>>>> Sometimes its Stage 1, 2 or 3.
>>>>>>>>>>>>
>>>>>>>>>>>> I am not sure what I am doing wrong...any help is much 
>>>>>>>>>>>> appreciated, thank you!
>>>>>>>>>>>>
>>>>>>>>>>>> -Su
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
>
> --
> Cell : 425-233-8271

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLlib Spam example gets stuck in Stage X

Reply via email to