Re: Only master is really busy at KMeans training

2014-08-26 Thread durin
Right now, I have issues even at a far earlier point.

I'm fetching data from a registerd table via

var texts = ctx.sql("SELECT text FROM tweetTrainTable LIMIT
2000").map(_.head.toString).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
//persisted because it's used again later

var dict = texts.flatMap(_.split(" ").map(_.toLowerCase())).repartition(80)
//80=2*num_cpu

var count = dict.count.toInt


As far as I can see, it's the repartitioning that is causingthe problems.
However, without that, I have only one partition for further RDD operations
on dict, so it seems to be necessary.

The errors given are

14/08/26 10:43:52 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.1
(TID 2300, idp11.foo.bar): java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
java.util.Arrays.copyOf(Arrays.java:3230)
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
   
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
...


Then the RDD operations start again, but later I will get

14/08/26 10:47:14 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.2
(TID 2655, idp41.foo.bar: java.lang.NullPointerException:
   
$line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:26)
   
$line39.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:26)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:236)


and another java.lang.OutOfMemoryError.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12842.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Only master is really busy at KMeans training

2014-08-26 Thread Xiangrui Meng
How many partitions now? Btw, which Spark version are you using? I
checked your code and I don't understand why you want to broadcast
vectors2, which is an RDD.

var vectors2 = 
vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
var broadcastVector = sc.broadcast(vectors2)

What is the total memory of your cluster? Does the dataset fit into
memory? If not, you can try turning on `spark.rdd.compress`. The whole
dataset is not small.

-Xiangrui

On Mon, Aug 25, 2014 at 11:46 PM, durin  wrote:
> With a lower number of partitions, I keep losing executors during
>collect at KMeans.scala:283
> The error message is "ExecutorLostFailure (executor lost)".
> The program recovers by automatically repartitioning the whole dataset
> (126G), which takes very long and seems to only delay the inevitable
> failure.
>
> Is there a recommended solution to this issue?
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12803.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Only master is really busy at KMeans training

2014-08-25 Thread durin
With a lower number of partitions, I keep losing executors during  
   collect at KMeans.scala:283
The error message is "ExecutorLostFailure (executor lost)". 
The program recovers by automatically repartitioning the whole dataset
(126G), which takes very long and seems to only delay the inevitable
failure.

Is there a recommended solution to this issue?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411p12803.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Only master is really busy at KMeans training

2014-08-19 Thread Xiangrui Meng
There are only 5 worker nodes. So please try to reduce the number of
partitions to the number of available CPU cores. 1000 partitions are
too bigger, because the driver needs to collect to task result from
each partition. -Xiangrui

On Tue, Aug 19, 2014 at 1:41 PM, durin  wrote:
> When trying to use KMeans.train with some large data and 5 worker nodes, it
> would due to BlockManagers shutting down because of timeout. I was able to
> prevent that by adding
>
> spark.storage.blockManagerSlaveTimeoutMs 300
>
> to the spark-defaults.conf.
>
> However, with 1 Million feature vectors, the Stage takeSample at
> KMeans.scala:263 runs for about 50 minutes. In this time, about half of the
> tasks are done, then I lose the executors and Spark starts a new
> repartitioning stage.
>
> I also noticed that in the takeSample stage, the task was running for about
> 2.5 minutes until suddenly it is finished and duration (prev. those 2.5min)
> change to 2s, with 0.9s GC time.
>
> The training data is supplied in this form:
> var vectors2 =
> vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
> var broadcastVector = sc.broadcast(vectors2)
>
> The 1000 partitions is something that could probably be optimized, but too
> few will cause OOM erros.
>
> Using Ganglia, I can see that the master node is the only one that is
> properly busy regarding CPU, and that most (600-700 of 800 total percent
> CPU) is used by the master.
> The workers on each node only use 1 Core, i.e. 100% CPU.
>
>
> What would be the most likely cause for such an inefficient use of the
> cluster, and how to prevent it?
> Number of partitions, way of caching, ...?
>
> I'm trying to find out myself with tests, but ideas from someone with more
> experience are very welcome.
>
>
> Best regards,
> simn
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Only master is really busy at KMeans training

2014-08-19 Thread durin
When trying to use KMeans.train with some large data and 5 worker nodes, it
would due to BlockManagers shutting down because of timeout. I was able to
prevent that by adding
 
spark.storage.blockManagerSlaveTimeoutMs 300

to the spark-defaults.conf.

However, with 1 Million feature vectors, the Stage takeSample at
KMeans.scala:263 runs for about 50 minutes. In this time, about half of the
tasks are done, then I lose the executors and Spark starts a new
repartitioning stage.

I also noticed that in the takeSample stage, the task was running for about
2.5 minutes until suddenly it is finished and duration (prev. those 2.5min)
change to 2s, with 0.9s GC time.

The training data is supplied in this form:
var vectors2 =
vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
var broadcastVector = sc.broadcast(vectors2)

The 1000 partitions is something that could probably be optimized, but too
few will cause OOM erros.

Using Ganglia, I can see that the master node is the only one that is
properly busy regarding CPU, and that most (600-700 of 800 total percent
CPU) is used by the master. 
The workers on each node only use 1 Core, i.e. 100% CPU.


What would be the most likely cause for such an inefficient use of the
cluster, and how to prevent it?
Number of partitions, way of caching, ...? 

I'm trying to find out myself with tests, but ideas from someone with more
experience are very welcome.


Best regards,
simn



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Only-master-is-really-busy-at-KMeans-training-tp12411.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org