Thank you for updating the files Holden! I actually was using that
same text in my files located on HDFS. Could the files being located
on HDFS be the reason why the example gets stuck? I c/p the code
provided on github, the only things I changed were:

a) file paths to: val spam = sc.textFile("hdfs://ip-...")

b) Shortened ham to 9 lines, and set numFeatures to 9 (also tried out 100).

c) added 3 count statements

The program outputs:

features in spam: 9 (spamFeatures.count())
features in ham: 9 (hamFeatures.count())
features in training data: 18 (trainingData.count())

and then gets stuck at :"count at DataValidators.scala:38" (as see on Web UI)

The completed jobs look like this:

Completed Jobs(4)
count at 21 (spamFeatures)
count at 23 (hamFeatures)
count at 28 (trainingData.count())
first at GeneralizedLinearAlgorithm at 34 (val model =

Is there anyway I can test to see if this is a problem with my Spark
setup? Thanks!


On Mon, Mar 30, 2015 at 12:10 PM, Holden Karau <> wrote:
> Thanks for pointing that out, I've updated the ham & spam example files, they 
> should be good from master currently.
> On Mon, Mar 30, 2015 at 10:16 AM, Xiangrui Meng <> wrote:
>> +Holden, Joseph
>> It seems that there is something wrong with the sample data file: 
>> -Xiangrui
>> On Fri, Mar 27, 2015 at 1:03 PM, Su She <> wrote:
>>> Hello Xiangrui,
>>> Hmm, yes I have run other Spark (word count, spark streaming/kafka, etc) 
>>> examples locally, the same way I'm trying to run this MLlib example (i've 
>>> tried local[2] and local [4]).
>>> 1) I did trainingData.count() and the job was completed. The output was 
>>> 2...should this only be 2 or 400 (since each text file has 200 words)?
>>> 2) I noticed the code says: val trainingData = positiveExamples ++ 
>>> negativeExamples
>>> I'm not very familiar with scala, but the ++ sign seems weird to me, but 
>>> when I tried to only have one + sign, it did not build
>>> 3) I found a similar thread 
>>> here...
>>> it looks like Emily had the same problem (count at 
>>> DataValidators.scala:38), but doesn't seem like a solution was found. Also, 
>>> I don't get any of those errors printed to the console.
>>> 4) sorry, not sure what else to say, as this is a pretty basic example. 
>>> thank you for the help!
>>> best,
>>> Su
>>> On Fri, Mar 27, 2015 at 11:23 AM, Xiangrui Meng <> wrote:
>>>> Hi Su,
>>>> I'm not sure what the problem is. Did you try other Spark examples on your 
>>>> cluster? Did they work? Could you try
>>>> trainingData.count()
>>>> before calling Just want to check whether this is an MLlib 
>>>> issue.
>>>> Thanks,
>>>> Xiangrui
>>>> On Wed, Mar 25, 2015 at 3:27 PM, Su She <> wrote:
>>>>> Hello Everyone,
>>>>> I was hoping to see if anyone has any additional thoughts on this as I 
>>>>> was able to find barely anything related to this error online (something 
>>>>> related to dependencies/breeze?)...thank you!
>>>>> Best,
>>>>> Su
>>>>> On Thu, Mar 19, 2015 at 10:54 AM, Su She <> wrote:
>>>>>> Hello Akhil,
>>>>>> I tried running it in an application, and I got the same result. The app 
>>>>>> gets stuck in Stage 1 at MLlib.scala at line 32 which in my app 
>>>>>> corresponds to: val model =
>>>>>> These are the details:
>>>>>> org.apache.spark.rdd.RDD.count(RDD.scala:910)
>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>>> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
>>>>>> scala.collection.immutable.List.forall(List.scala:84)
>>>>>> MLlib$.main(MLlib.scala:32)
>>>>>> MLlib.main(MLlib.scala)
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(
>>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>>>>> java.lang.reflect.Method.invoke(
>>>>>> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
>>>>>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
>>>>>> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>>>> Thank you for the help Akhil!
>>>>>> Best,
>>>>>> Su
>>>>>> On Thu, Mar 19, 2015 at 1:27 AM, Akhil Das <> 
>>>>>> wrote:
>>>>>>> It seems its stuck at doing a count? What happening at line 38? I'm not 
>>>>>>> seeing count operation in this code  anywhere 
>>>>>>> Thanks
>>>>>>> Best Regards
>>>>>>> On Thu, Mar 19, 2015 at 1:32 PM, Su She <> wrote:
>>>>>>>> Hello Akhil,
>>>>>>>> Thanks for the info! Here is my UI...I am not sure what to make of the 
>>>>>>>> information here:
>>>>>>>> Details of active stage:
>>>>>>>> org.apache.spark.rdd.RDD.count(RDD.scala:910)
>>>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38)
>>>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37)
>>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161)
>>>>>>>> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70)
>>>>>>>> scala.collection.immutable.List.forall(List.scala:84)
>>>>>>>> $line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33)
>>>>>>>> $line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38)
>>>>>>>> $line21.$read$$iwC$$iwC.<init>(<console>:40)
>>>>>>>> $line21.$read$$iwC.<init>(<console>:42)
>>>>>>>> $line21.$read.<init>(<console>:44)
>>>>>>>> $line21.$read$.<init>(<console>:48)
>>>>>>>> $line21.$read$.<clinit>(<console>)
>>>>>>>> $line21.$eval$.<init>(<console>:7)
>>>>>>>> $line21.$eval$.<clinit>(<console>)
>>>>>>>> $line21.$eval.$print(<console>)
>>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>>>> Thank you for the help Akhil!
>>>>>>>> -Su
>>>>>>>> On Thu, Mar 19, 2015 at 12:49 AM, Akhil Das 
>>>>>>>> <> wrote:
>>>>>>>>> To get these metrics out, you need to open the driver ui running on 
>>>>>>>>> port 4040. And in there you will see Stages information and for each 
>>>>>>>>> stage you can see how much time it is spending on GC etc. In your 
>>>>>>>>> case, the parallelism seems 4, the more # of parallelism the more # 
>>>>>>>>> of tasks you will see.
>>>>>>>>> Thanks
>>>>>>>>> Best Regards
>>>>>>>>> On Thu, Mar 19, 2015 at 1:15 PM, Su She <> wrote:
>>>>>>>>>> Hi Akhil,
>>>>>>>>>> 1) How could I see how much time it is spending on stage 1? Or what 
>>>>>>>>>> if, like above, it doesn't get past stage 1?
>>>>>>>>>> 2) How could I check if its a GC time? and where would I increase 
>>>>>>>>>> the parallelism for the model? I have a Spark Master and 2 Workers 
>>>>>>>>>> running on CDH 5.3...what would the default spark-shell level of 
>>>>>>>>>> parallelism be...I thought it would be 3?
>>>>>>>>>> Thank you for the help!
>>>>>>>>>> -Su
>>>>>>>>>> On Thu, Mar 19, 2015 at 12:32 AM, Akhil Das 
>>>>>>>>>> <> wrote:
>>>>>>>>>>> Can you see where exactly it is spending time? Like you said it 
>>>>>>>>>>> goes to Stage 2, then you will be able to see how much time it 
>>>>>>>>>>> spend on Stage 1. See if its a GC time, then try increasing the 
>>>>>>>>>>> level of parallelism or repartition it like 
>>>>>>>>>>> sc.getDefaultParallelism*3.
>>>>>>>>>>> Thanks
>>>>>>>>>>> Best Regards
>>>>>>>>>>> On Thu, Mar 19, 2015 at 12:15 PM, Su She <> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> Hello Everyone,
>>>>>>>>>>>> I am trying to run this MLlib example from Learning Spark:
>>>>>>>>>>>> Things I'm doing differently:
>>>>>>>>>>>> 1) Using spark shell instead of an application
>>>>>>>>>>>> 2) instead of their spam.txt and normal.txt I have text files with 
>>>>>>>>>>>> 3700 and 2700 words...nothing huge at all and just plain text
>>>>>>>>>>>> 3) I've used numFeatures = 100, 1000 and 10,000
>>>>>>>>>>>> Error: I keep getting stuck when I try to run the model:
>>>>>>>>>>>> val model = new LogisticRegressionWithSGD().run(trainingData)
>>>>>>>>>>>> It will freeze on something like this:
>>>>>>>>>>>> [Stage 1:==============>                                           
>>>>>>>>>>>>  (1 + 0) / 4]
>>>>>>>>>>>> Sometimes its Stage 1, 2 or 3.
>>>>>>>>>>>> I am not sure what I am doing wrong...any help is much 
>>>>>>>>>>>> appreciated, thank you!
>>>>>>>>>>>> -Su
> --
> Cell : 425-233-8271

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to