Hi Xi, Please create a JIRA if it takes longer to locate the issue. Did you try a smaller k?
Best, Xiangrui On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen <davidshe...@gmail.com> wrote: > Hi Burak, > > After I added .repartition(sc.defaultParallelism), I can see from the log > the partition number is set to 32. But in the Spark UI, it seems all the > data are loaded onto one executor. Previously they were loaded onto 4 > executors. > > Any idea? > > > Thanks, > David > > > On Fri, Mar 27, 2015 at 11:01 AM Xi Shen <davidshe...@gmail.com> wrote: >> >> How do I get the number of cores that I specified at the command line? I >> want to use "spark.default.parallelism". I have 4 executors, each has 8 >> cores. According to >> https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior, >> the "spark.default.parallelism" value will be 4 * 8 = 32...I think it is too >> large, or inappropriate. Please give some suggestion. >> >> I have already used cache, and count to pre-cache. >> >> I can try with smaller k for testing, but eventually I will have to use k >> = 5000 or even large. Because I estimate our data set would have that much >> of clusters. >> >> >> Thanks, >> David >> >> >> On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz <brk...@gmail.com> wrote: >>> >>> Hi David, >>> The number of centroids (k=5000) seems too large and is probably the >>> cause of the code taking too long. >>> >>> Can you please try the following: >>> 1) Repartition data to the number of available cores with >>> .repartition(numCores) >>> 2) cache data >>> 3) call .count() on data right before k-means >>> 4) try k=500 (even less if possible) >>> >>> Thanks, >>> Burak >>> >>> On Mar 26, 2015 4:15 PM, "Xi Shen" <davidshe...@gmail.com> wrote: >>> > >>> > The code is very simple. >>> > >>> > val data = sc.textFile("very/large/text/file") map { l => >>> > // turn each line into dense vector >>> > Vectors.dense(...) >>> > } >>> > >>> > // the resulting data set is about 40k vectors >>> > >>> > KMeans.train(data, k=5000, maxIterations=500) >>> > >>> > I just kill my application. In the log I found this: >>> > >>> > 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of >>> > block broadcast_26_piece0 >>> > 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in >>> > connection from >>> > workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277 >>> > java.io.IOException: An existing connection was forcibly closed by the >>> > remote host >>> > >>> > Notice the time gap. I think it means the work node did not generate >>> > any log at all for about 12hrs...does it mean they are not working at all? >>> > >>> > But when testing with very small data set, my application works and >>> > output expected data. >>> > >>> > >>> > Thanks, >>> > David >>> > >>> > >>> > On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz <brk...@gmail.com> wrote: >>> >> >>> >> Can you share the code snippet of how you call k-means? Do you cache >>> >> the data before k-means? Did you repartition the data? >>> >> >>> >> On Mar 26, 2015 4:02 PM, "Xi Shen" <davidshe...@gmail.com> wrote: >>> >>> >>> >>> OH, the job I talked about has ran more than 11 hrs without a >>> >>> result...it doesn't make sense. >>> >>> >>> >>> >>> >>> On Fri, Mar 27, 2015 at 9:48 AM Xi Shen <davidshe...@gmail.com> >>> >>> wrote: >>> >>>> >>> >>>> Hi Burak, >>> >>>> >>> >>>> My iterations is set to 500. But I think it should also stop of the >>> >>>> centroid coverages, right? >>> >>>> >>> >>>> My spark is 1.2.0, working in windows 64 bit. My data set is about >>> >>>> 40k vectors, each vector has about 300 features, all normalised. All >>> >>>> work >>> >>>> node have sufficient memory and disk space. >>> >>>> >>> >>>> Thanks, >>> >>>> David >>> >>>> >>> >>>> >>> >>>> On Fri, 27 Mar 2015 02:48 Burak Yavuz <brk...@gmail.com> wrote: >>> >>>>> >>> >>>>> Hi David, >>> >>>>> >>> >>>>> When the number of runs are large and the data is not properly >>> >>>>> partitioned, it seems that K-Means is hanging according to my >>> >>>>> experience. >>> >>>>> Especially setting the number of runs to something high drastically >>> >>>>> increases the work in executors. If that's not the case, can you give >>> >>>>> more >>> >>>>> info on what Spark version you are using, your setup, and your >>> >>>>> dataset? >>> >>>>> >>> >>>>> Thanks, >>> >>>>> Burak >>> >>>>> >>> >>>>> On Mar 26, 2015 5:10 AM, "Xi Shen" <davidshe...@gmail.com> wrote: >>> >>>>>> >>> >>>>>> Hi, >>> >>>>>> >>> >>>>>> When I run k-means cluster with Spark, I got this in the last two >>> >>>>>> lines in the log: >>> >>>>>> >>> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 >>> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> Then it hangs for a long time. There's no active job. The driver >>> >>>>>> machine is idle. I cannot access the work node, I am not sure if >>> >>>>>> they are >>> >>>>>> busy. >>> >>>>>> >>> >>>>>> I understand k-means may take a long time to finish. But why no >>> >>>>>> active job? no log? >>> >>>>>> >>> >>>>>> >>> >>>>>> Thanks, >>> >>>>>> David >>> >>>>>> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org