Wow, great job. We will take a look and try our application again with your patch.
On Tue, Aug 26, 2014 at 5:31 AM, Kousuke Saruta <saru...@oss.nttdata.co.jp> wrote: > Hi Shengzhe > > I faced to same situation. > > I think, Connection and ConnectionManager have some race condition issues > and the error you mentioned may be caused by the issues. > Now I'm trying to resolve the issue in https://github.com/apache/ > spark/pull/2019. > Please check it out. > > - Kousuke > > > (2014/08/26 8:53), yao wrote: > >> Hi Folks, >> >> We are testing our home-made KMeans algorithm using Spark on Yarn. >> Recently, we've found that the application failed frequently when doing >> clustering over 300,000,000 users (each user is represented by a feature >> vector and the whole data set is around 600,000,000). After digging into >> the job log, we've found that there are many CancelledKeyException throwed >> by ConnectionManager but not observed other exceptions. We double frequent >> CancelledKeyException brings the whole application down since the >> application often failed on the third or fourth iteration for large >> datasets. Welcome to any directional suggestions. >> >> *Errors in job log*: >> >> java.nio.channels.CancelledKeyException >> at >> org.apache.spark.network.ConnectionManager.run( >> ConnectionManager.scala:363) >> at >> org.apache.spark.network.ConnectionManager$$anon$4.run( >> ConnectionManager.scala:116) >> 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to >> ConnectionManagerId(lsv-289.rfiserve.net,43199) >> 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding >> SendingConnectionManagerId not found >> 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? >> sun.nio.ch.SelectionKeyImpl@2570cd62 >> 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? >> sun.nio.ch.SelectionKeyImpl@2570cd62 >> java.nio.channels.CancelledKeyException >> at >> org.apache.spark.network.ConnectionManager.run( >> ConnectionManager.scala:363) >> at >> org.apache.spark.network.ConnectionManager$$anon$4.run( >> ConnectionManager.scala:116) >> 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to >> ConnectionManagerId(lsv-289.rfiserve.net,56727) >> 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to >> ConnectionManagerId(lsv-289.rfiserve.net,56727) >> 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to >> ConnectionManagerId(lsv-289.rfiserve.net,56727) >> 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? >> sun.nio.ch.SelectionKeyImpl@37c8b85a >> 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? >> sun.nio.ch.SelectionKeyImpl@37c8b85a >> java.nio.channels.CancelledKeyException >> at >> org.apache.spark.network.ConnectionManager.run( >> ConnectionManager.scala:287) >> at >> org.apache.spark.network.ConnectionManager$$anon$4.run( >> ConnectionManager.scala:116) >> 14/08/25 19:04:32 INFO ConnectionManager: Removing SendingConnection to >> ConnectionManagerId(lsv-668.rfiserve.net,41913) >> 14/08/25 19:04:32 INFO ConnectionManager: Removing ReceivingConnection to >> ConnectionManagerId(lsv-668.rfiserve.net,41913) >> 14/08/25 19:04:32 INFO ConnectionManager: Key not valid ? >> sun.nio.ch.SelectionKeyImpl@fcea3a4 >> 14/08/25 19:04:32 ERROR ConnectionManager: Corresponding >> SendingConnectionManagerId not found >> 14/08/25 19:04:32 INFO ConnectionManager: key already cancelled ? >> sun.nio.ch.SelectionKeyImpl@fcea3a4 >> >> >> Best >> Shengzhe >> >> >