How do I get the number of cores that I specified at the command line? I
want to use "spark.default.parallelism". I have 4 executors, each has 8
cores. According to
https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
the "spark.default.parallelism" value will be 4 * 8 = 32...I think it is
too large, or inappropriate. Please give some suggestion.

I have already used cache, and count to pre-cache.

I can try with smaller k for testing, but eventually I will have to use k =
5000 or even large. Because I estimate our data set would have that much of
clusters.


Thanks,
David


On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz <brk...@gmail.com> wrote:

> Hi David,
> The number of centroids (k=5000) seems too large and is probably the cause
> of the code taking too long.
>
> Can you please try the following:
> 1) Repartition data to the number of available cores with
> .repartition(numCores)
> 2) cache data
> 3) call .count() on data right before k-means
> 4) try k=500 (even less if possible)
>
> Thanks,
> Burak
>
> On Mar 26, 2015 4:15 PM, "Xi Shen" <davidshe...@gmail.com> wrote:
> >
> > The code is very simple.
> >
> > val data = sc.textFile("very/large/text/file") map { l =>
> >   // turn each line into dense vector
> >   Vectors.dense(...)
> > }
> >
> > // the resulting data set is about 40k vectors
> >
> > KMeans.train(data, k=5000, maxIterations=500)
> >
> > I just kill my application. In the log I found this:
> >
> > 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of block
> broadcast_26_piece0
> > 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
> connection from
> workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
> > java.io.IOException: An existing connection was forcibly closed by the
> remote host
> >
> > Notice the time gap. I think it means the work node did not generate any
> log at all for about 12hrs...does it mean they are not working at all?
> >
> > But when testing with very small data set, my application works and
> output expected data.
> >
> >
> > Thanks,
> > David
> >
> >
> > On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz <brk...@gmail.com> wrote:
> >>
> >> Can you share the code snippet of how you call k-means? Do you cache
> the data before k-means? Did you repartition the data?
> >>
> >> On Mar 26, 2015 4:02 PM, "Xi Shen" <davidshe...@gmail.com> wrote:
> >>>
> >>> OH, the job I talked about has ran more than 11 hrs without a
> result...it doesn't make sense.
> >>>
> >>>
> >>> On Fri, Mar 27, 2015 at 9:48 AM Xi Shen <davidshe...@gmail.com> wrote:
> >>>>
> >>>> Hi Burak,
> >>>>
> >>>> My iterations is set to 500. But I think it should also stop of the
> centroid coverages, right?
> >>>>
> >>>> My spark is 1.2.0, working in windows 64 bit. My data set is about
> 40k vectors, each vector has about 300 features, all normalised. All work
> node have sufficient memory and disk space.
> >>>>
> >>>> Thanks,
> >>>> David
> >>>>
> >>>>
> >>>> On Fri, 27 Mar 2015 02:48 Burak Yavuz <brk...@gmail.com> wrote:
> >>>>>
> >>>>> Hi David,
> >>>>>
> >>>>> When the number of runs are large and the data is not properly
> partitioned, it seems that K-Means is hanging according to my experience.
> Especially setting the number of runs to something high drastically
> increases the work in executors. If that's not the case, can you give more
> info on what Spark version you are using, your setup, and your dataset?
> >>>>>
> >>>>> Thanks,
> >>>>> Burak
> >>>>>
> >>>>> On Mar 26, 2015 5:10 AM, "Xi Shen" <davidshe...@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> When I run k-means cluster with Spark, I got this in the last two
> lines in the log:
> >>>>>>
> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Then it hangs for a long time. There's no active job. The driver
> machine is idle. I cannot access the work node, I am not sure if they are
> busy.
> >>>>>>
> >>>>>> I understand k-means may take a long time to finish. But why no
> active job? no log?
> >>>>>>
> >>>>>>
> >>>>>> Thanks,
> >>>>>> David
> >>>>>>
>

Reply via email to