Re: too many CancelledKeyException throwed from ConnectionManager

2014-08-26 Thread Kousuke Saruta

Hi Shengzhe

I faced to same situation.

I think, Connection and ConnectionManager have some race condition issues
and the error you mentioned may be caused by the issues.
Now I'm trying to resolve the issue in 
https://github.com/apache/spark/pull/2019.

Please check it out.

- Kousuke

(2014/08/26 8:53), yao wrote:

Hi Folks,

We are testing our home-made KMeans algorithm using Spark on Yarn.
Recently, we've found that the application failed frequently when doing
clustering over 300,000,000 users (each user is represented by a feature
vector and the whole data set is around 600,000,000). After digging into
the job log, we've found that there are many CancelledKeyException throwed
by ConnectionManager but not observed other exceptions. We double frequent
CancelledKeyException brings the whole application down since the
application often failed on the third or fourth iteration for large
datasets. Welcome to any directional suggestions.

*Errors in job log*:
java.nio.channels.CancelledKeyException
 at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
 at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,43199)
14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@2570cd62
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@2570cd62
java.nio.channels.CancelledKeyException
 at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
 at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@37c8b85a
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@37c8b85a
java.nio.channels.CancelledKeyException
 at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:287)
 at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-668.rfiserve.net,41913)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-668.rfiserve.net,41913)
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@fcea3a4
14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@fcea3a4


Best
Shengzhe




-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



too many CancelledKeyException throwed from ConnectionManager

2014-08-25 Thread yao
Hi Folks,

We are testing our home-made KMeans algorithm using Spark on Yarn.
Recently, we've found that the application failed frequently when doing
clustering over 300,000,000 users (each user is represented by a feature
vector and the whole data set is around 600,000,000). After digging into
the job log, we've found that there are many CancelledKeyException throwed
by ConnectionManager but not observed other exceptions. We double frequent
CancelledKeyException brings the whole application down since the
application often failed on the third or fourth iteration for large
datasets. Welcome to any directional suggestions.

*Errors in job log*:
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,43199)
14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@2570cd62
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@2570cd62
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@37c8b85a
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@37c8b85a
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:287)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-668.rfiserve.net,41913)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-668.rfiserve.net,41913)
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@fcea3a4
14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@fcea3a4


Best
Shengzhe