Hi Shengzhe

I faced to same situation.

I think, Connection and ConnectionManager have some race condition issues
and the error you mentioned may be caused by the issues.
Now I'm trying to resolve the issue in https://github.com/apache/spark/pull/2019.
Please check it out.

- Kousuke

(2014/08/26 8:53), yao wrote:
Hi Folks,

We are testing our home-made KMeans algorithm using Spark on Yarn.
Recently, we've found that the application failed frequently when doing
clustering over 300,000,000 users (each user is represented by a feature
vector and the whole data set is around 600,000,000). After digging into
the job log, we've found that there are many CancelledKeyException throwed
by ConnectionManager but not observed other exceptions. We double frequent
CancelledKeyException brings the whole application down since the
application often failed on the third or fourth iteration for large
datasets. Welcome to any directional suggestions.

*Errors in job log*:
java.nio.channels.CancelledKeyException
         at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
         at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,43199)
14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@2570cd62
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@2570cd62
java.nio.channels.CancelledKeyException
         at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
         at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-289.rfiserve.net,56727)
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@37c8b85a
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@37c8b85a
java.nio.channels.CancelledKeyException
         at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:287)
         at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(lsv-668.rfiserve.net,41913)
14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(lsv-668.rfiserve.net,41913)
14/08/25 19:04:32 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@fcea3a4
14/08/25 19:04:32 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@fcea3a4


Best
Shengzhe



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to