The code is very simple.

val data = sc.textFile("very/large/text/file") map { l =>
  // turn each line into dense vector
  Vectors.dense(...)
}

// the resulting data set is about 40k vectors

KMeans.train(data, k=5000, maxIterations=500)

I just kill my application. In the log I found this:

15/03/26 *11:42:43* INFO storage.BlockManagerMaster: Updated info of block
broadcast_26_piece0
15/03/26 *23:02:57* WARN server.TransportChannelHandler: Exception in
connection from
workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
java.io.IOException: An existing connection was forcibly closed by the
remote host

Notice the time gap. I think it means the work node did not generate any
log at all for about 12hrs...does it mean they are not working at all?

But when testing with very small data set, my application works and output
expected data.


Thanks,
David


On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz <brk...@gmail.com> wrote:

> Can you share the code snippet of how you call k-means? Do you cache the
> data before k-means? Did you repartition the data?
> On Mar 26, 2015 4:02 PM, "Xi Shen" <davidshe...@gmail.com> wrote:
>
>> OH, the job I talked about has ran more than 11 hrs without a result...it
>> doesn't make sense.
>>
>>
>> On Fri, Mar 27, 2015 at 9:48 AM Xi Shen <davidshe...@gmail.com> wrote:
>>
>>> Hi Burak,
>>>
>>> My iterations is set to 500. But I think it should also stop of the
>>> centroid coverages, right?
>>>
>>> My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
>>> vectors, each vector has about 300 features, all normalised. All work node
>>> have sufficient memory and disk space.
>>>
>>> Thanks,
>>> David
>>>
>>> On Fri, 27 Mar 2015 02:48 Burak Yavuz <brk...@gmail.com> wrote:
>>>
>>>> Hi David,
>>>>
>>>> When the number of runs are large and the data is not properly
>>>> partitioned, it seems that K-Means is hanging according to my experience.
>>>> Especially setting the number of runs to something high drastically
>>>> increases the work in executors. If that's not the case, can you give more
>>>> info on what Spark version you are using, your setup, and your dataset?
>>>>
>>>> Thanks,
>>>> Burak
>>>> On Mar 26, 2015 5:10 AM, "Xi Shen" <davidshe...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> When I run k-means cluster with Spark, I got this in the last two
>>>>> lines in the log:
>>>>>
>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26
>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
>>>>>
>>>>>
>>>>>
>>>>> Then it hangs for a long time. There's no active job. The driver
>>>>> machine is idle. I cannot access the work node, I am not sure if they are
>>>>> busy.
>>>>>
>>>>> I understand k-means may take a long time to finish. But why no active
>>>>> job? no log?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> David
>>>>>
>>>>>

Reply via email to