Thank you for updating the files Holden! I actually was using that same text in my files located on HDFS. Could the files being located on HDFS be the reason why the example gets stuck? I c/p the code provided on github, the only things I changed were:
a) file paths to: val spam = sc.textFile("hdfs://ip-...") b) Shortened ham to 9 lines, and set numFeatures to 9 (also tried out 100). c) added 3 count statements The program outputs: features in spam: 9 (spamFeatures.count()) features in ham: 9 (hamFeatures.count()) features in training data: 18 (trainingData.count()) and then gets stuck at :"count at DataValidators.scala:38" (as see on Web UI) The completed jobs look like this: Completed Jobs(4) count at 21 (spamFeatures) count at 23 (hamFeatures) count at 28 (trainingData.count()) first at GeneralizedLinearAlgorithm at 34 (val model = lrLearner.run(trainingData)) Is there anyway I can test to see if this is a problem with my Spark setup? Thanks! Su On Mon, Mar 30, 2015 at 12:10 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > > > Thanks for pointing that out, I've updated the ham & spam example files, they > should be good from master currently. > > On Mon, Mar 30, 2015 at 10:16 AM, Xiangrui Meng <men...@gmail.com> wrote: >> >> +Holden, Joseph >> >> It seems that there is something wrong with the sample data file: >> https://github.com/databricks/learning-spark/blob/master/files/ham.txt >> >> -Xiangrui >> >> On Fri, Mar 27, 2015 at 1:03 PM, Su She <suhsheka...@gmail.com> wrote: >>> >>> Hello Xiangrui, >>> >>> Hmm, yes I have run other Spark (word count, spark streaming/kafka, etc) >>> examples locally, the same way I'm trying to run this MLlib example (i've >>> tried local[2] and local [4]). >>> >>> 1) I did trainingData.count() and the job was completed. The output was >>> 2...should this only be 2 or 400 (since each text file has 200 words)? >>> >>> 2) I noticed the code says: val trainingData = positiveExamples ++ >>> negativeExamples >>> >>> I'm not very familiar with scala, but the ++ sign seems weird to me, but >>> when I tried to only have one + sign, it did not build >>> >>> 3) I found a similar thread >>> here...http://mail-archives.apache.org/mod_mbox/spark-user/201409.mbox/%3ccafrxrqf6drxlcsb7q1-w1puayadpnb8womwiwe_8++okq2c...@mail.gmail.com%3E >>> >>> it looks like Emily had the same problem (count at >>> DataValidators.scala:38), but doesn't seem like a solution was found. Also, >>> I don't get any of those errors printed to the console. >>> >>> 4) sorry, not sure what else to say, as this is a pretty basic example. >>> thank you for the help! >>> >>> best, >>> >>> Su >>> >>> On Fri, Mar 27, 2015 at 11:23 AM, Xiangrui Meng <men...@gmail.com> wrote: >>>> >>>> Hi Su, >>>> >>>> I'm not sure what the problem is. Did you try other Spark examples on your >>>> cluster? Did they work? Could you try >>>> >>>> trainingData.count() >>>> >>>> before calling lrLearn.run()? Just want to check whether this is an MLlib >>>> issue. >>>> >>>> Thanks, >>>> Xiangrui >>>> >>>> On Wed, Mar 25, 2015 at 3:27 PM, Su She <suhsheka...@gmail.com> wrote: >>>>> >>>>> Hello Everyone, >>>>> >>>>> I was hoping to see if anyone has any additional thoughts on this as I >>>>> was able to find barely anything related to this error online (something >>>>> related to dependencies/breeze?)...thank you! >>>>> >>>>> Best, >>>>> >>>>> Su >>>>> >>>>> On Thu, Mar 19, 2015 at 10:54 AM, Su She <suhsheka...@gmail.com> wrote: >>>>>> >>>>>> Hello Akhil, >>>>>> >>>>>> I tried running it in an application, and I got the same result. The app >>>>>> gets stuck in Stage 1 at MLlib.scala at line 32 which in my app >>>>>> corresponds to: val model = lrLearner.run(trainingData). >>>>>> >>>>>> These are the details: >>>>>> >>>>>> org.apache.spark.rdd.RDD.count(RDD.scala:910) >>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38) >>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37) >>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161) >>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161) >>>>>> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70) >>>>>> scala.collection.immutable.List.forall(List.scala:84) >>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161) >>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146) >>>>>> MLlib$.main(MLlib.scala:32) >>>>>> MLlib.main(MLlib.scala) >>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>> java.lang.reflect.Method.invoke(Method.java:606) >>>>>> org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) >>>>>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) >>>>>> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >>>>>> >>>>>> >>>>>> Thank you for the help Akhil! >>>>>> >>>>>> Best, >>>>>> >>>>>> Su >>>>>> >>>>>> >>>>>> On Thu, Mar 19, 2015 at 1:27 AM, Akhil Das <ak...@sigmoidanalytics.com> >>>>>> wrote: >>>>>>> >>>>>>> It seems its stuck at doing a count? What happening at line 38? I'm not >>>>>>> seeing count operation in this code anywhere >>>>>>> https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48 >>>>>>> >>>>>>> Thanks >>>>>>> Best Regards >>>>>>> >>>>>>> On Thu, Mar 19, 2015 at 1:32 PM, Su She <suhsheka...@gmail.com> wrote: >>>>>>>> >>>>>>>> Hello Akhil, >>>>>>>> >>>>>>>> Thanks for the info! Here is my UI...I am not sure what to make of the >>>>>>>> information here: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Details of active stage: >>>>>>>> >>>>>>>> org.apache.spark.rdd.RDD.count(RDD.scala:910) >>>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:38) >>>>>>>> org.apache.spark.mllib.util.DataValidators$$anonfun$1.apply(DataValidators.scala:37) >>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161) >>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm$$anonfun$run$2.apply(GeneralizedLinearAlgorithm.scala:161) >>>>>>>> scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70) >>>>>>>> scala.collection.immutable.List.forall(List.scala:84) >>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:161) >>>>>>>> org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:146) >>>>>>>> $line21.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:33) >>>>>>>> $line21.$read$$iwC$$iwC$$iwC.<init>(<console>:38) >>>>>>>> $line21.$read$$iwC$$iwC.<init>(<console>:40) >>>>>>>> $line21.$read$$iwC.<init>(<console>:42) >>>>>>>> $line21.$read.<init>(<console>:44) >>>>>>>> $line21.$read$.<init>(<console>:48) >>>>>>>> $line21.$read$.<clinit>(<console>) >>>>>>>> $line21.$eval$.<init>(<console>:7) >>>>>>>> $line21.$eval$.<clinit>(<console>) >>>>>>>> $line21.$eval.$print(<console>) >>>>>>>> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>>>> >>>>>>>> >>>>>>>> Thank you for the help Akhil! >>>>>>>> >>>>>>>> -Su >>>>>>>> >>>>>>>> On Thu, Mar 19, 2015 at 12:49 AM, Akhil Das >>>>>>>> <ak...@sigmoidanalytics.com> wrote: >>>>>>>>> >>>>>>>>> To get these metrics out, you need to open the driver ui running on >>>>>>>>> port 4040. And in there you will see Stages information and for each >>>>>>>>> stage you can see how much time it is spending on GC etc. In your >>>>>>>>> case, the parallelism seems 4, the more # of parallelism the more # >>>>>>>>> of tasks you will see. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Best Regards >>>>>>>>> >>>>>>>>> On Thu, Mar 19, 2015 at 1:15 PM, Su She <suhsheka...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> Hi Akhil, >>>>>>>>>> >>>>>>>>>> 1) How could I see how much time it is spending on stage 1? Or what >>>>>>>>>> if, like above, it doesn't get past stage 1? >>>>>>>>>> >>>>>>>>>> 2) How could I check if its a GC time? and where would I increase >>>>>>>>>> the parallelism for the model? I have a Spark Master and 2 Workers >>>>>>>>>> running on CDH 5.3...what would the default spark-shell level of >>>>>>>>>> parallelism be...I thought it would be 3? >>>>>>>>>> >>>>>>>>>> Thank you for the help! >>>>>>>>>> >>>>>>>>>> -Su >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Mar 19, 2015 at 12:32 AM, Akhil Das >>>>>>>>>> <ak...@sigmoidanalytics.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Can you see where exactly it is spending time? Like you said it >>>>>>>>>>> goes to Stage 2, then you will be able to see how much time it >>>>>>>>>>> spend on Stage 1. See if its a GC time, then try increasing the >>>>>>>>>>> level of parallelism or repartition it like >>>>>>>>>>> sc.getDefaultParallelism*3. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Best Regards >>>>>>>>>>> >>>>>>>>>>> On Thu, Mar 19, 2015 at 12:15 PM, Su She <suhsheka...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hello Everyone, >>>>>>>>>>>> >>>>>>>>>>>> I am trying to run this MLlib example from Learning Spark: >>>>>>>>>>>> https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala#L48 >>>>>>>>>>>> >>>>>>>>>>>> Things I'm doing differently: >>>>>>>>>>>> >>>>>>>>>>>> 1) Using spark shell instead of an application >>>>>>>>>>>> >>>>>>>>>>>> 2) instead of their spam.txt and normal.txt I have text files with >>>>>>>>>>>> 3700 and 2700 words...nothing huge at all and just plain text >>>>>>>>>>>> >>>>>>>>>>>> 3) I've used numFeatures = 100, 1000 and 10,000 >>>>>>>>>>>> >>>>>>>>>>>> Error: I keep getting stuck when I try to run the model: >>>>>>>>>>>> >>>>>>>>>>>> val model = new LogisticRegressionWithSGD().run(trainingData) >>>>>>>>>>>> >>>>>>>>>>>> It will freeze on something like this: >>>>>>>>>>>> >>>>>>>>>>>> [Stage 1:==============> >>>>>>>>>>>> (1 + 0) / 4] >>>>>>>>>>>> >>>>>>>>>>>> Sometimes its Stage 1, 2 or 3. >>>>>>>>>>>> >>>>>>>>>>>> I am not sure what I am doing wrong...any help is much >>>>>>>>>>>> appreciated, thank you! >>>>>>>>>>>> >>>>>>>>>>>> -Su >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > > > > -- > Cell : 425-233-8271 --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org