[jira] [Commented] (SPARK-18893) Not support "alter table .. add columns .."
[ https://issues.apache.org/jira/browse/SPARK-18893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824848#comment-15824848 ] Hong Shen commented on SPARK-18893: --- +1, "alter table add/replace columns" is a very important feature. [~yhuai] [~rxin] > Not support "alter table .. add columns .." > > > Key: SPARK-18893 > URL: https://issues.apache.org/jira/browse/SPARK-18893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: zuotingbing > > when we update spark from version 1.5.2 to 2.0.1, all cases we have need > change the table use "alter table add columns " failed, but it is said "All > Hive DDL Functions, including: alter table" in the official document : > http://spark.apache.org/docs/latest/sql-programming-guide.html. > Is there any plan to support sql "alter table .. add/replace columns" ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18910) Can't use UDF that jar file in hdfs
[ https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-18910: -- Description: When I create a UDF that jar file in hdfs, I can't use the UDF. {code} spark-sql> create function trans_array as 'com.test.udf.TransArray' using jar 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; spark-sql> describe function trans_array; Function: test_db.trans_array Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray Usage: N/A. Time taken: 0.127 seconds, Fetched 3 row(s) spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from test_spark limit 10; Error in query: Undefined function: 'trans_array'. This function is neither a registered temporary function nor a permanent function registered in the database 'test_db'.; line 1 pos 7 {code} The reason is when org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, the uri.toURL throw exception with " failed unknown protocol: hdfs" {code} def addJar(path: String): Unit = { sparkSession.sparkContext.addJar(path) val uri = new Path(path).toUri val jarURL = if (uri.getScheme == null) { // `path` is a local file path without a URL scheme new File(path).toURI.toURL } else { // `path` is a URL with a scheme {color:red}uri.toURL{color} } jarClassLoader.addURL(jarURL) Thread.currentThread().setContextClassLoader(jarClassLoader) } {code} I think we should setURLStreamHandlerFactory method on URL with an instance of FsUrlStreamHandlerFactory, just like: {code} static { URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); } {code} was: When I create a UDF that jar file in hdfs, I can't use the UDF. {code} spark-sql> create function trans_array as 'com.test.udf.TransArray' using jar 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; spark-sql> describe function trans_array; Function: test_db.trans_array Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray Usage: N/A. Time taken: 0.127 seconds, Fetched 3 row(s) spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from test_spark limit 10; Error in query: Undefined function: 'trans_array'. This function is neither a registered temporary function nor a permanent function registered in the database 'test_db'.; line 1 pos 7 {code} The reason is when org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, the uri.toURL throw exception with " failed unknown protocol: hdfs" {code} def addJar(path: String): Unit = { sparkSession.sparkContext.addJar(path) val uri = new Path(path).toUri val jarURL = if (uri.getScheme == null) { // `path` is a local file path without a URL scheme new File(path).toURI.toURL } else { // `path` is a URL with a scheme {color:red}uri.toURL{color} } jarClassLoader.addURL(jarURL) Thread.currentThread().setContextClassLoader(jarClassLoader) } {code} I think we should setURLStreamHandlerFactory method on URL with an instance of FsUrlStreamHandlerFactory, just like: {code} static { // This method can be called at most once in a given JVM. URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); } {code} > Can't use UDF that jar file in hdfs > --- > > Key: SPARK-18910 > URL: https://issues.apache.org/jira/browse/SPARK-18910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Hong Shen > > When I create a UDF that jar file in hdfs, I can't use the UDF. > {code} > spark-sql> create function trans_array as 'com.test.udf.TransArray' using > jar > 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; > spark-sql> describe function trans_array; > Function: test_db.trans_array > Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray > Usage: N/A. > Time taken: 0.127 seconds, Fetched 3 row(s) > spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) > from test_spark limit 10; > Error in query: Undefined function: 'trans_array'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'test_db'.; line 1 pos 7 > {code} > The reason is when > org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, > the uri.toURL throw exception with " failed unknown protocol: hdfs" > {code} > def addJar(path: String): Unit = { > sparkSession.sparkContext.addJar(path) > val uri = new Path(path).toUri > val jarURL = if (uri.getScheme == null) { > // `path` is a local file path without a URL scheme > new File(path).toURI.toURL
[jira] [Updated] (SPARK-18910) Can't use UDF that jar file in hdfs
[ https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-18910: -- Description: When I create a UDF that jar file in hdfs, I can't use the UDF. {code} spark-sql> create function trans_array as 'com.test.udf.TransArray' using jar 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; spark-sql> describe function trans_array; Function: test_db.trans_array Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray Usage: N/A. Time taken: 0.127 seconds, Fetched 3 row(s) spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from test_spark limit 10; Error in query: Undefined function: 'trans_array'. This function is neither a registered temporary function nor a permanent function registered in the database 'test_db'.; line 1 pos 7 {code} The reason is when org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, the uri.toURL throw exception with " failed unknown protocol: hdfs" {code} def addJar(path: String): Unit = { sparkSession.sparkContext.addJar(path) val uri = new Path(path).toUri val jarURL = if (uri.getScheme == null) { // `path` is a local file path without a URL scheme new File(path).toURI.toURL } else { // `path` is a URL with a scheme {color:red}uri.toURL{color} } jarClassLoader.addURL(jarURL) Thread.currentThread().setContextClassLoader(jarClassLoader) } {code} I think we should setURLStreamHandlerFactory method on URL with an instance of FsUrlStreamHandlerFactory, just like: {code} static { // This method can be called at most once in a given JVM. URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); } {code} was: When I create a UDF that jar file in hdfs, I can't use the UDF. spark-sql> create function trans_array as 'com.test.udf.TransArray' using jar 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; spark-sql> describe function trans_array; Function: test_db.trans_array Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray Usage: N/A. Time taken: 0.127 seconds, Fetched 3 row(s) spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from test_spark limit 10; Error in query: Undefined function: 'trans_array'. This function is neither a registered temporary function nor a permanent function registered in the database 'test_db'.; line 1 pos 7 The reason is when org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, the uri.toURL throw exception with " failed unknown protocol: hdfs" def addJar(path: String): Unit = { sparkSession.sparkContext.addJar(path) val uri = new Path(path).toUri val jarURL = if (uri.getScheme == null) { // `path` is a local file path without a URL scheme new File(path).toURI.toURL } else { // `path` is a URL with a scheme {color:red}uri.toURL{color} } jarClassLoader.addURL(jarURL) Thread.currentThread().setContextClassLoader(jarClassLoader) } I think we should setURLStreamHandlerFactory method on URL with an instance of FsUrlStreamHandlerFactory, just like: static { // This method can be called at most once in a given JVM. URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); } > Can't use UDF that jar file in hdfs > --- > > Key: SPARK-18910 > URL: https://issues.apache.org/jira/browse/SPARK-18910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Hong Shen > > When I create a UDF that jar file in hdfs, I can't use the UDF. > {code} > spark-sql> create function trans_array as 'com.test.udf.TransArray' using > jar > 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; > spark-sql> describe function trans_array; > Function: test_db.trans_array > Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray > Usage: N/A. > Time taken: 0.127 seconds, Fetched 3 row(s) > spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) > from test_spark limit 10; > Error in query: Undefined function: 'trans_array'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'test_db'.; line 1 pos 7 > {code} > The reason is when > org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, > the uri.toURL throw exception with " failed unknown protocol: hdfs" > {code} > def addJar(path: String): Unit = { > sparkSession.sparkContext.addJar(path) > val uri = new Path(path).toUri > val jarURL = if (uri.getScheme == null) { > // `path` is a local file path without a URL scheme >
[jira] [Updated] (SPARK-18910) Can't use UDF that jar file in hdfs
[ https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-18910: -- Description: When I create a UDF that jar file in hdfs, I can't use the UDF. spark-sql> create function trans_array as 'com.test.udf.TransArray' using jar 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; spark-sql> describe function trans_array; Function: test_db.trans_array Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray Usage: N/A. Time taken: 0.127 seconds, Fetched 3 row(s) spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from test_spark limit 10; Error in query: Undefined function: 'trans_array'. This function is neither a registered temporary function nor a permanent function registered in the database 'test_db'.; line 1 pos 7 The reason is when org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, the uri.toURL throw exception with " failed unknown protocol: hdfs" def addJar(path: String): Unit = { sparkSession.sparkContext.addJar(path) val uri = new Path(path).toUri val jarURL = if (uri.getScheme == null) { // `path` is a local file path without a URL scheme new File(path).toURI.toURL } else { // `path` is a URL with a scheme {color:red}uri.toURL{color} } jarClassLoader.addURL(jarURL) Thread.currentThread().setContextClassLoader(jarClassLoader) } I think we should setURLStreamHandlerFactory method on URL with an instance of FsUrlStreamHandlerFactory, just like: static { // This method can be called at most once in a given JVM. URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); } was: When I create a UDF that jar file in hdfs, I can't use the UDF. spark-sql> create function trans_array as 'com.test.udf.TransArray' using jar 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; spark-sql> describe function trans_array; Function: test_db.trans_array Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray Usage: N/A. Time taken: 0.127 seconds, Fetched 3 row(s) spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from test_spark limit 10; Error in query: Undefined function: 'trans_array'. This function is neither a registered temporary function nor a permanent function registered in the database 'test_db'.; line 1 pos 7 The reason is when org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, the uri.toURL throw exception with " failed unknown protocol: hdfs" def addJar(path: String): Unit = { sparkSession.sparkContext.addJar(path) val uri = new Path(path).toUri val jarURL = if (uri.getScheme == null) { // `path` is a local file path without a URL scheme new File(path).toURI.toURL } else { // `path` is a URL with a scheme uri.toURL } jarClassLoader.addURL(jarURL) Thread.currentThread().setContextClassLoader(jarClassLoader) } I think we should setURLStreamHandlerFactory method on URL with an instance of FsUrlStreamHandlerFactory, just like: static { // This method can be called at most once in a given JVM. URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); } > Can't use UDF that jar file in hdfs > --- > > Key: SPARK-18910 > URL: https://issues.apache.org/jira/browse/SPARK-18910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Hong Shen > > When I create a UDF that jar file in hdfs, I can't use the UDF. > > spark-sql> create function trans_array as 'com.test.udf.TransArray' using > jar > 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; > spark-sql> describe function trans_array; > Function: test_db.trans_array > Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray > Usage: N/A. > Time taken: 0.127 seconds, Fetched 3 row(s) > spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) > from test_spark limit 10; > Error in query: Undefined function: 'trans_array'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'test_db'.; line 1 pos 7 > > The reason is when > org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, > the uri.toURL throw exception with " failed unknown protocol: hdfs" > > def addJar(path: String): Unit = { > sparkSession.sparkContext.addJar(path) > val uri = new Path(path).toUri > val jarURL = if (uri.getScheme == null) { > // `path` is a local file path without a URL scheme > new File(path).toURI.toURL > } else { > // `path` is a URL
[jira] [Commented] (SPARK-18910) Can't use UDF that jar file in hdfs
[ https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15756385#comment-15756385 ] Hong Shen commented on SPARK-18910: --- Should I add a pull request to resolve this problem? > Can't use UDF that jar file in hdfs > --- > > Key: SPARK-18910 > URL: https://issues.apache.org/jira/browse/SPARK-18910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Hong Shen > > When I create a UDF that jar file in hdfs, I can't use the UDF. > > spark-sql> create function trans_array as 'com.test.udf.TransArray' using > jar > 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; > spark-sql> describe function trans_array; > Function: test_db.trans_array > Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray > Usage: N/A. > Time taken: 0.127 seconds, Fetched 3 row(s) > spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) > from test_spark limit 10; > Error in query: Undefined function: 'trans_array'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'test_db'.; line 1 pos 7 > > The reason is when > org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, > the uri.toURL throw exception with " failed unknown protocol: hdfs" > > def addJar(path: String): Unit = { > sparkSession.sparkContext.addJar(path) > val uri = new Path(path).toUri > val jarURL = if (uri.getScheme == null) { > // `path` is a local file path without a URL scheme > new File(path).toURI.toURL > } else { > // `path` is a URL with a scheme > uri.toURL > } > jarClassLoader.addURL(jarURL) > Thread.currentThread().setContextClassLoader(jarClassLoader) > } > > I think we should setURLStreamHandlerFactory method on URL with an instance > of FsUrlStreamHandlerFactory, just like: > > static { > // This method can be called at most once in a given JVM. > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); > } > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18910) Can't use UDF that jar file in hdfs
[ https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-18910: -- Description: When I create a UDF that jar file in hdfs, I can't use the UDF. spark-sql> create function trans_array as 'com.test.udf.TransArray' using jar 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; spark-sql> describe function trans_array; Function: test_db.trans_array Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray Usage: N/A. Time taken: 0.127 seconds, Fetched 3 row(s) spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from test_spark limit 10; Error in query: Undefined function: 'trans_array'. This function is neither a registered temporary function nor a permanent function registered in the database 'test_db'.; line 1 pos 7 The reason is when org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, the uri.toURL throw exception with " failed unknown protocol: hdfs" def addJar(path: String): Unit = { sparkSession.sparkContext.addJar(path) val uri = new Path(path).toUri val jarURL = if (uri.getScheme == null) { // `path` is a local file path without a URL scheme new File(path).toURI.toURL } else { // `path` is a URL with a scheme uri.toURL } jarClassLoader.addURL(jarURL) Thread.currentThread().setContextClassLoader(jarClassLoader) } I think we should setURLStreamHandlerFactory method on URL with an instance of FsUrlStreamHandlerFactory, just like: static { // This method can be called at most once in a given JVM. URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); } was: When I create a UDF that jar spark-sql> create function trans_array as 'com.test.udf.TransArray' using jar 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; spark-sql> describe function trans_array; Function: test_db.trans_array Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray Usage: N/A. Time taken: 0.127 seconds, Fetched 3 row(s) spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from test_spark limit 10; Error in query: Undefined function: 'trans_array'. This function is neither a registered temporary function nor a permanent function registered in the database 'test_db'.; line 1 pos 7 > Can't use UDF that jar file in hdfs > --- > > Key: SPARK-18910 > URL: https://issues.apache.org/jira/browse/SPARK-18910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Hong Shen > > When I create a UDF that jar file in hdfs, I can't use the UDF. > > spark-sql> create function trans_array as 'com.test.udf.TransArray' using > jar > 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; > spark-sql> describe function trans_array; > Function: test_db.trans_array > Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray > Usage: N/A. > Time taken: 0.127 seconds, Fetched 3 row(s) > spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) > from test_spark limit 10; > Error in query: Undefined function: 'trans_array'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'test_db'.; line 1 pos 7 > > The reason is when > org.apache.spark.sql.internal.SessionState.FunctionResourceLoader.loadResource, > the uri.toURL throw exception with " failed unknown protocol: hdfs" > > def addJar(path: String): Unit = { > sparkSession.sparkContext.addJar(path) > val uri = new Path(path).toUri > val jarURL = if (uri.getScheme == null) { > // `path` is a local file path without a URL scheme > new File(path).toURI.toURL > } else { > // `path` is a URL with a scheme > uri.toURL > } > jarClassLoader.addURL(jarURL) > Thread.currentThread().setContextClassLoader(jarClassLoader) > } > > I think we should setURLStreamHandlerFactory method on URL with an instance > of FsUrlStreamHandlerFactory, just like: > > static { > // This method can be called at most once in a given JVM. > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); > } > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18910) Can't use UDF that jar file in hdfs
[ https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-18910: -- Summary: Can't use UDF that jar file in hdfs (was: Can't use UDF that source file in hdfs) > Can't use UDF that jar file in hdfs > --- > > Key: SPARK-18910 > URL: https://issues.apache.org/jira/browse/SPARK-18910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Hong Shen > > When I create a UDF that jar > > spark-sql> create function trans_array as 'com.test.udf.TransArray' using > jar > 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; > spark-sql> describe function trans_array; > Function: test_db.trans_array > Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray > Usage: N/A. > Time taken: 0.127 seconds, Fetched 3 row(s) > spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) > from test_spark limit 10; > Error in query: Undefined function: 'trans_array'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'test_db'.; line 1 pos 7 > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18910) Can't use UDF that source file in hdfs
[ https://issues.apache.org/jira/browse/SPARK-18910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-18910: -- Description: When I create a UDF that jar spark-sql> create function trans_array as 'com.test.udf.TransArray' using jar 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; spark-sql> describe function trans_array; Function: test_db.trans_array Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray Usage: N/A. Time taken: 0.127 seconds, Fetched 3 row(s) spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) from test_spark limit 10; Error in query: Undefined function: 'trans_array'. This function is neither a registered temporary function nor a permanent function registered in the database 'test_db'.; line 1 pos 7 > Can't use UDF that source file in hdfs > -- > > Key: SPARK-18910 > URL: https://issues.apache.org/jira/browse/SPARK-18910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Hong Shen > > When I create a UDF that jar > > spark-sql> create function trans_array as 'com.test.udf.TransArray' using > jar > 'hdfs://host1:9000/spark/dev/share/libs/spark-proxy-server-biz-service-impl-1.0.0.jar'; > spark-sql> describe function trans_array; > Function: test_db.trans_array > Class: com.alipay.spark.proxy.server.biz.service.impl.udf.TransArray > Usage: N/A. > Time taken: 0.127 seconds, Fetched 3 row(s) > spark-sql> select trans_array(1, '\\|', id, position) as (id0, position0) > from test_spark limit 10; > Error in query: Undefined function: 'trans_array'. This function is neither a > registered temporary function nor a permanent function registered in the > database 'test_db'.; line 1 pos 7 > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18910) Can't use UDF that source file in hdfs
Hong Shen created SPARK-18910: - Summary: Can't use UDF that source file in hdfs Key: SPARK-18910 URL: https://issues.apache.org/jira/browse/SPARK-18910 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: Hong Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16989) fileSystem.getFileStatus throw exception in EventLoggingListener
[ https://issues.apache.org/jira/browse/SPARK-16989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416320#comment-15416320 ] Hong Shen commented on SPARK-16989: --- Thanks for your comment. In our cluster, the log directory is a unified config. the clusters have runned for a long time, and the cluster often add and remove machine, we don't want to restart the clusters for adding a log directory, so we want to add a config to create the log directory. > fileSystem.getFileStatus throw exception in EventLoggingListener > > > Key: SPARK-16989 > URL: https://issues.apache.org/jira/browse/SPARK-16989 > Project: Spark > Issue Type: Improvement >Affects Versions: 2.0.0 >Reporter: Hong Shen >Priority: Minor > > If log directory does not exist, it will throw exception as the follow log. > {code} > 16/05/02 22:24:22 ERROR spark.SparkContext: Error initializing SparkContext. > java.io.FileNotFoundException: File > file:/data/tdwadmin/tdwenv/tdwgaia/logs/sparkhistory/intermediate-done-dir > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) > at > org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:109) > at org.apache.spark.SparkContext.(SparkContext.scala:612) > at > org.apache.spark.deploy.yarn.SQLApplicationMaster.(SQLApplicationMaster.scala:78) > at > org.apache.spark.deploy.yarn.SQLApplicationMaster.(SQLApplicationMaster.scala:46) > at > org.apache.spark.deploy.yarn.SQLApplicationMaster$$anonfun$main$1.apply$mcV$sp(SQLApplicationMaster.scala:311) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:70) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:69) > at > org.apache.spark.deploy.yarn.SQLApplicationMaster$.main(SQLApplicationMaster.scala:310) > at > org.apache.spark.deploy.yarn.SQLApplicationMaster.main(SQLApplicationMaster.scala) > {code} > {code} > if (!fileSystem.getFileStatus(new Path(logBaseDir)).isDirectory) { > throw new IllegalArgumentException(s"Log directory $logBaseDir does not > exist.") > } > {code} > There are two problems. > 1 The judgment should add if fileSystem.exists(new Path(logBaseDir)), to > prevent throw exception at getFileStatus. > 2 I think we should add a choose let the applicaitonmaster to create log > directory. Like this: > {code} > if (!fileSystem.exists(lp) && > sparkConf.getBoolean("spark.eventLog.create.if.baseDir.not.exist", > true)) { > fileSystem.mkdirs(lp) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16989) fileSystem.getFileStatus throw exception in EventLoggingListener
[ https://issues.apache.org/jira/browse/SPARK-16989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16989: -- Description: If log directory does not exist, it will throw exception as the follow log. {code} 16/05/02 22:24:22 ERROR spark.SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/data/tdwadmin/tdwenv/tdwgaia/logs/sparkhistory/intermediate-done-dir does not exist at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:109) at org.apache.spark.SparkContext.(SparkContext.scala:612) at org.apache.spark.deploy.yarn.SQLApplicationMaster.(SQLApplicationMaster.scala:78) at org.apache.spark.deploy.yarn.SQLApplicationMaster.(SQLApplicationMaster.scala:46) at org.apache.spark.deploy.yarn.SQLApplicationMaster$$anonfun$main$1.apply$mcV$sp(SQLApplicationMaster.scala:311) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:70) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:69) at org.apache.spark.deploy.yarn.SQLApplicationMaster$.main(SQLApplicationMaster.scala:310) at org.apache.spark.deploy.yarn.SQLApplicationMaster.main(SQLApplicationMaster.scala) {code} {code} if (!fileSystem.getFileStatus(new Path(logBaseDir)).isDirectory) { throw new IllegalArgumentException(s"Log directory $logBaseDir does not exist.") } {code} There are two problems. 1 The judgment should add if fileSystem.exists(new Path(logBaseDir)), to prevent throw exception at getFileStatus. 2 I think we should add a choose let the applicaitonmaster to create log directory. Like this: {code} if (!fileSystem.exists(lp) && sparkConf.getBoolean("spark.eventLog.create.if.baseDir.not.exist", true)) { fileSystem.mkdirs(lp) } {code} was: 16/05/02 22:24:22 ERROR spark.SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/data/tdwadmin/tdwenv/tdwgaia/logs/sparkhistory/intermediate-done-dir does not exist at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:109) at org.apache.spark.SparkContext.(SparkContext.scala:612) at org.apache.spark.deploy.yarn.SQLApplicationMaster.(SQLApplicationMaster.scala:78) at org.apache.spark.deploy.yarn.SQLApplicationMaster.(SQLApplicationMaster.scala:46) at org.apache.spark.deploy.yarn.SQLApplicationMaster$$anonfun$main$1.apply$mcV$sp(SQLApplicationMaster.scala:311) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:70) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:69) at org.apache.spark.deploy.yarn.SQLApplicationMaster$.main(SQLApplicationMaster.scala:310) at org.apache.spark.deploy.yarn.SQLApplicationMaster.main(SQLApplicationMaster.scala) > fileSystem.getFileStatus throw exception in EventLoggingListener > > > Key: SPARK-16989 > URL: https://issues.apache.org/jira/browse/SPARK-16989 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Hong Shen > > If log directory does not exist, it will throw exception as the follow log. > {code} > 16/05/02 22:24:22 ERROR spark.SparkContext: Error initializing SparkContext. > java.io.FileNotFoundException: File > file:/data/tdwadmin/tdwenv/tdwgaia/logs/sparkhistory/intermediate-done-dir > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) > at > org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:109) > at
[jira] [Created] (SPARK-16989) fileSystem.getFileStatus throw exception in EventLoggingListener
Hong Shen created SPARK-16989: - Summary: fileSystem.getFileStatus throw exception in EventLoggingListener Key: SPARK-16989 URL: https://issues.apache.org/jira/browse/SPARK-16989 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Hong Shen 16/05/02 22:24:22 ERROR spark.SparkContext: Error initializing SparkContext. java.io.FileNotFoundException: File file:/data/tdwadmin/tdwenv/tdwgaia/logs/sparkhistory/intermediate-done-dir does not exist at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:109) at org.apache.spark.SparkContext.(SparkContext.scala:612) at org.apache.spark.deploy.yarn.SQLApplicationMaster.(SQLApplicationMaster.scala:78) at org.apache.spark.deploy.yarn.SQLApplicationMaster.(SQLApplicationMaster.scala:46) at org.apache.spark.deploy.yarn.SQLApplicationMaster$$anonfun$main$1.apply$mcV$sp(SQLApplicationMaster.scala:311) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:70) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:69) at org.apache.spark.deploy.yarn.SQLApplicationMaster$.main(SQLApplicationMaster.scala:310) at org.apache.spark.deploy.yarn.SQLApplicationMaster.main(SQLApplicationMaster.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16985) SQL Output maybe overrided
[ https://issues.apache.org/jira/browse/SPARK-16985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414644#comment-15414644 ] Hong Shen commented on SPARK-16985: --- The reason is the output file use SimpleDateFormat("MMddHHmm"), if two sql insert into the same table in the same minute, the output will be overrite. I think we should change dateFormat to "MMddHHmmss", in our cluster, we can't finished a sql in one second. > SQL Output maybe overrided > -- > > Key: SPARK-16985 > URL: https://issues.apache.org/jira/browse/SPARK-16985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hong Shen > > In our cluster, sometimes the sql output maybe overrided. When I submit some > sql, all insert into the same table, and the sql will cost less one minute, > here is the detail, > 1 sql1, 11:03 insert into table. > 2 sql2, 11:04:11 insert into table. > 3 sql3, 11:04:48 insert into table. > 4 sql4, 11:05 insert into table. > 5 sql5, 11:06 insert into table. > The sql3's output file will override the sql2's output file. here is the log: > {code} > 16/05/04 11:04:11 INFO hive.SparkHiveHadoopWriter: > XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_1204544348/1/_tmp.p_20160428/attempt_201605041104_0001_m_00_1 > 16/05/04 11:04:48 INFO hive.SparkHiveHadoopWriter: > XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_212180468/1/_tmp.p_20160428/attempt_201605041104_0001_m_00_1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16985) SQL Output maybe overrided
[ https://issues.apache.org/jira/browse/SPARK-16985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16985: -- Description: In our cluster, sometimes the sql output maybe overrided. When I submit some sql, all insert into the same table, and the sql will cost less one minute, here is the detail, 1 sql1, 11:03 insert into table. 2 sql2, 11:04:11 insert into table. 3 sql3, 11:04:48 insert into table. 4 sql4, 11:05 insert into table. 5 sql5, 11:06 insert into table. The sql3's output file will override the sql2's output file. here is the log: {code} 16/05/04 11:04:11 INFO hive.SparkHiveHadoopWriter: XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_1204544348/1/_tmp.p_20160428/attempt_201605041104_0001_m_00_1 16/05/04 11:04:48 INFO hive.SparkHiveHadoopWriter: XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_212180468/1/_tmp.p_20160428/attempt_201605041104_0001_m_00_1 {code} > SQL Output maybe overrided > -- > > Key: SPARK-16985 > URL: https://issues.apache.org/jira/browse/SPARK-16985 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hong Shen > > In our cluster, sometimes the sql output maybe overrided. When I submit some > sql, all insert into the same table, and the sql will cost less one minute, > here is the detail, > 1 sql1, 11:03 insert into table. > 2 sql2, 11:04:11 insert into table. > 3 sql3, 11:04:48 insert into table. > 4 sql4, 11:05 insert into table. > 5 sql5, 11:06 insert into table. > The sql3's output file will override the sql2's output file. here is the log: > {code} > 16/05/04 11:04:11 INFO hive.SparkHiveHadoopWriter: > XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_1204544348/1/_tmp.p_20160428/attempt_201605041104_0001_m_00_1 > 16/05/04 11:04:48 INFO hive.SparkHiveHadoopWriter: > XXfinalPath=hdfs://tl-sng-gdt-nn-tdw.tencent-distribute.com:54310/tmp/assorz/tdw-tdwadmin/20160504/04559505496526517_-1_212180468/1/_tmp.p_20160428/attempt_201605041104_0001_m_00_1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16985) SQL Output maybe overrided
Hong Shen created SPARK-16985: - Summary: SQL Output maybe overrided Key: SPARK-16985 URL: https://issues.apache.org/jira/browse/SPARK-16985 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Hong Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411620#comment-15411620 ] Hong Shen commented on SPARK-16709: --- If you agree it's a bug, then I will reopen the jira and add a patch to fix it. > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > Attachments: commit failed.png > > > In our cluster, we set spark.speculation=true, but when a task throw > exception at SparkHadoopMapRedUtil.performCommit(), this task can retry > infinite. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411616#comment-15411616 ] Hong Shen commented on SPARK-16709: --- Sorry for late, this is different from SPARK-14915, in the fact, we have already change the code that SPARK-14915 changed. {code} sched.dagScheduler.taskEnded(tasks(index), reason, null, null, info, taskMetrics) // If speculative enable, when one task succeed, the other task with state RUNNING will be killed. // The killed task will statusUpdate with state KILLED/FAILED. // In this case, the task should not be re-add. if (!successful(index)) { addPendingTask(index) } if (!isZombie && state != TaskState.KILLED {code} Here is the reason, the log is {code} 16/07/28 05:22:15 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.0 (TID 175, 10.215.146.81, partition 1,PROCESS_LOCAL, 1930 bytes) 16/07/28 05:28:35 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 1.1 (TID 207, 10.196.147.232, partition 1,PROCESS_LOCAL, 1930 bytes) 16/07/28 05:28:48 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 1.0 (TID 175) in 393261 ms on 10.215.146.81 (3/50) 16/07/28 05:34:11 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.1 (TID 207, 10.196.147.232): TaskCommitDenied (Driver denied task commit) for job: 1, partition: 1, attemptNumber: 207 {code} 1 task 1.0 in stage1.0 start 2 stage1.0 failed, start stage1.1. 3 task 1.0 in stage1.1 start 4 task 1.0 in stage1.0 finished. 5 task 1.0 in stage1.1 failed with TaskCommitDenied Exception, then retry forever. > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > Attachments: commit failed.png > > > In our cluster, we set spark.speculation=true, but when a task throw > exception at SparkHadoopMapRedUtil.performCommit(), this task can retry > infinite. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393062#comment-15393062 ] Hong Shen edited comment on SPARK-16709 at 7/26/16 2:10 AM: It different, the task has no attempt succeed. This happen when a task attempt failed when performCommit(), the following attempts can't commit any more, because just one task can performCommit(), even though the first attempt failed at performCommit(). was (Author: shenhong): It different, the task has no attempt succeed. this happen when a task attempt failed when performCommit(), the following attempts can commit any more, because just one task can performCommit(), even though the first attempt failed at performCommit(). > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > Attachments: commit failed.png > > > In our cluster, we set spark.speculation=true, but when a task throw > exception at SparkHadoopMapRedUtil.performCommit(), this task can retry > infinite. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393062#comment-15393062 ] Hong Shen commented on SPARK-16709: --- It different, the task has no attempt succeed. this happen when a task attempt failed when performCommit(), the following attempts can commit any more, because just one task can performCommit(), even though the first attempt failed at performCommit(). > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > Attachments: commit failed.png > > > In our cluster, we set spark.speculation=true, but when a task throw > exception at SparkHadoopMapRedUtil.performCommit(), this task can retry > infinite. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16708) ExecutorAllocationManager.numRunningTasks can be negative when stage retry
[ https://issues.apache.org/jira/browse/SPARK-16708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393054#comment-15393054 ] Hong Shen commented on SPARK-16708: --- It's different,ExecutorAllocationManager.numRunningTasks be negative can cause ExecutorAllocationManager.maxNumExecutorsNeeded be negative, sometimes it can lead to many task pending to running, but ExecutorAllocationManager not allocate executors. > ExecutorAllocationManager.numRunningTasks can be negative when stage retry > -- > > Key: SPARK-16708 > URL: https://issues.apache.org/jira/browse/SPARK-16708 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > When a task fetch failed, the stage will complete and retry, when the stage > complete, ExecutorAllocationManager.numRunningTasks will be set 0, here is > the code: > {code} > override def onStageCompleted(stageCompleted: > SparkListenerStageCompleted): Unit = { > val stageId = stageCompleted.stageInfo.stageId > allocationManager.synchronized { > stageIdToNumTasks -= stageId > stageIdToTaskIndices -= stageId > stageIdToExecutorPlacementHints -= stageId > // Update the executor placement hints > updateExecutorPlacementHints() > // If this is the last stage with pending tasks, mark the scheduler > queue as empty > // This is needed in case the stage is aborted for any reason > if (stageIdToNumTasks.isEmpty) { > allocationManager.onSchedulerQueueEmpty() > if (numRunningTasks != 0) { > logWarning("No stages are running, but numRunningTasks != 0") > numRunningTasks = 0 > } > } > } > } > {code} > But when the stage's running task finished, numRunningTasks will minus 1, so > numRunningTasks be negative, it can cause maxNeeded be negative. > {code} > override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { > val executorId = taskEnd.taskInfo.executorId > val taskId = taskEnd.taskInfo.taskId > val taskIndex = taskEnd.taskInfo.index > val stageId = taskEnd.stageId > allocationManager.synchronized { > numRunningTasks -= 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16707) TransportClientFactory.createClient may throw NPE
[ https://issues.apache.org/jira/browse/SPARK-16707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392987#comment-15392987 ] Hong Shen commented on SPARK-16707: --- It happened many times in our cluster(with thousand of machines), but it's too hard to reproduce this, I can just retry the way stackoverflow.com describe. > TransportClientFactory.createClient may throw NPE > - > > Key: SPARK-16707 > URL: https://issues.apache.org/jira/browse/SPARK-16707 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > I have encounter some NullPointerException when > TransportClientFactory.createClient in my cluster, here is the following > stack trace. > {code} > org.apache.spark.shuffle.FetchFailedException > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.NullPointerException > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:144) > at > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:107) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:146) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:126) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:155) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:319) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299) > ... 32 more > {code} > The code is at >
[jira] [Updated] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16709: -- Attachment: commit failed.png > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > Attachments: commit failed.png > > > In our cluster, we set spark.speculation=true, but when a task throw > exception at SparkHadoopMapRedUtil.performCommit(), this task can retry > infinite. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16709: -- Attachment: (was: `N0)80~9A(AMT~2}%Q`M[RY.png) > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, we set spark.speculation=true, but when a task throw > exception at SparkHadoopMapRedUtil.performCommit(), this task can retry > infinite. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16709: -- Attachment: `N0)80~9A(AMT~2}%Q`M[RY.png > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, we set spark.speculation=true, but when a task throw > exception at SparkHadoopMapRedUtil.performCommit(), this task can retry > infinite. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16709: -- Description: In our cluster, we set spark.speculation=true, but when a task throw exception at SparkHadoopMapRedUtil.performCommit(), this task can retry infinite. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 was: In our cluster, we set spark.speculation=true, but when a task throw exception at SparkHadoopMapRedUtil.performCommit(), this task can retry infinite. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 Index â–´ID Attempt Status Locality Level Executor ID / Host Launch Time DurationGC Time Output Size / Records Shuffle Read Size / Records Errors 0 30910 SUCCESS PROCESS_LOCAL 1554 / 10.215.134.227 2016/07/25 02:33:39 19 min 1.2 min 0.0 B / 149525281121.5 MB / 15286949 0 37940 FAILED PROCESS_LOCAL 2027 / 10.215.146.29 2016/07/25 02:47:04 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 3794 0 40941 FAILED PROCESS_LOCAL 2546 / 10.196.150.233 2016/07/25 03:08:58 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 4094 0 43842 FAILED PROCESS_LOCAL 2823 / 10.215.155.155 2016/07/25 03:29:50 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 4384 0 45733 FAILED PROCESS_LOCAL 3011 / 10.215.139.24 2016/07/25 03:46:50 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 4573 0 48054 SUCCESS PROCESS_LOCAL 3246 / 10.196.138.215 2016/07/25 04:06:12 0 ms1.3 min 0.0 B / 149524481121.5 MB / 15286949 1 30920 SUCCESS PROCESS_LOCAL 1505 / 10.196.130.102 2016/07/25 02:33:39 22 min 4.9 min 0.0 B / 149536921121.9 MB / 15288628 1 37950 FAILED PROCESS_LOCAL 2253 / 10.196.145.33 2016/07/25 02:48:28 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 1, attemptNumber: 3795 1 40741 FAILED PROCESS_LOCAL 2493 / 10.196.148.109 2016/07/25 03:08:49 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 1, attemptNumber: 4074 1 42632 FAILED PROCESS_LOCAL 2705 / 10.196.149.21 2016/07/25 03:25:05 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 1, attemptNumber: 4263 > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, we set spark.speculation=true, but when a task throw > exception at SparkHadoopMapRedUtil.performCommit(), this task can retry > infinite. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16709: -- Description: In our cluster, we set spark.speculation=true, but when a task throw exception at SparkHadoopMapRedUtil.performCommit(), this task can retry infinite. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 Index â–´ID Attempt Status Locality Level Executor ID / Host Launch Time DurationGC Time Output Size / Records Shuffle Read Size / Records Errors 0 30910 SUCCESS PROCESS_LOCAL 1554 / 10.215.134.227 2016/07/25 02:33:39 19 min 1.2 min 0.0 B / 149525281121.5 MB / 15286949 0 37940 FAILED PROCESS_LOCAL 2027 / 10.215.146.29 2016/07/25 02:47:04 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 3794 0 40941 FAILED PROCESS_LOCAL 2546 / 10.196.150.233 2016/07/25 03:08:58 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 4094 0 43842 FAILED PROCESS_LOCAL 2823 / 10.215.155.155 2016/07/25 03:29:50 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 4384 0 45733 FAILED PROCESS_LOCAL 3011 / 10.215.139.24 2016/07/25 03:46:50 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 4573 0 48054 SUCCESS PROCESS_LOCAL 3246 / 10.196.138.215 2016/07/25 04:06:12 0 ms1.3 min 0.0 B / 149524481121.5 MB / 15286949 1 30920 SUCCESS PROCESS_LOCAL 1505 / 10.196.130.102 2016/07/25 02:33:39 22 min 4.9 min 0.0 B / 149536921121.9 MB / 15288628 1 37950 FAILED PROCESS_LOCAL 2253 / 10.196.145.33 2016/07/25 02:48:28 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 1, attemptNumber: 3795 1 40741 FAILED PROCESS_LOCAL 2493 / 10.196.148.109 2016/07/25 03:08:49 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 1, attemptNumber: 4074 1 42632 FAILED PROCESS_LOCAL 2705 / 10.196.149.21 2016/07/25 03:25:05 / / TaskCommitDenied (Driver denied task commit) for job: 2, partition: 1, attemptNumber: 4263 was: In our cluster, we set spark.speculation=true, but when a task throw exception at SparkHadoopMapRedUtil.performCommit(), this task can retry infinite. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, we set spark.speculation=true, but when a task throw > exception at SparkHadoopMapRedUtil.performCommit(), this task can retry > infinite. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 > Index â–´ ID Attempt Status Locality Level Executor ID / Host > Launch Time DurationGC Time Output Size / Records Shuffle Read > Size / Records Errors > 0 30910 SUCCESS PROCESS_LOCAL 1554 / 10.215.134.227 > 2016/07/25 02:33:39 19 min 1.2 min 0.0 B / 149525281121.5 MB / > 15286949 > 0 37940 FAILED PROCESS_LOCAL 2027 / 10.215.146.29 > 2016/07/25 02:47:04 / / TaskCommitDenied > (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 3794 > 0 40941 FAILED PROCESS_LOCAL 2546 / 10.196.150.233 > 2016/07/25 03:08:58 / / TaskCommitDenied > (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 4094 > 0 43842 FAILED PROCESS_LOCAL 2823 / 10.215.155.155 > 2016/07/25 03:29:50 / / TaskCommitDenied > (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 4384 > 0 45733 FAILED PROCESS_LOCAL 3011 / 10.215.139.24 > 2016/07/25 03:46:50 / / TaskCommitDenied > (Driver denied task commit) for job: 2, partition: 0, attemptNumber: 4573 > 0 48054 SUCCESS PROCESS_LOCAL 3246 / 10.196.138.215 > 2016/07/25
[jira] [Updated] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16709: -- Description: In our cluster, we set spark. When a task throw exception at SparkHadoopMapRedUtil.performCommit(), this task can retry infinite https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, we set spark. When a task throw exception at > SparkHadoopMapRedUtil.performCommit(), this task can retry infinite > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16709: -- Description: In our cluster, we set spark.speculation=true, but when a task throw exception at SparkHadoopMapRedUtil.performCommit(), this task can retry infinite. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 was: In our cluster, we set spark. When a task throw exception at SparkHadoopMapRedUtil.performCommit(), this task can retry infinite https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, we set spark.speculation=true, but when a task throw > exception at SparkHadoopMapRedUtil.performCommit(), this task can retry > infinite. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
[ https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16709: -- Affects Version/s: 1.6.0 > Task with commit failed will retry infinite when speculation set to true > > > Key: SPARK-16709 > URL: https://issues.apache.org/jira/browse/SPARK-16709 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true
Hong Shen created SPARK-16709: - Summary: Task with commit failed will retry infinite when speculation set to true Key: SPARK-16709 URL: https://issues.apache.org/jira/browse/SPARK-16709 Project: Spark Issue Type: Bug Reporter: Hong Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16708) ExecutorAllocationManager.numRunningTasks can be negative when stage retry
[ https://issues.apache.org/jira/browse/SPARK-16708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16708: -- Description: When a task fetch failed, the stage will complete and retry, when the stage complete, ExecutorAllocationManager.numRunningTasks will be set 0, here is the code: {code} override def onStageCompleted(stageCompleted: SparkListenerStageCompleted): Unit = { val stageId = stageCompleted.stageInfo.stageId allocationManager.synchronized { stageIdToNumTasks -= stageId stageIdToTaskIndices -= stageId stageIdToExecutorPlacementHints -= stageId // Update the executor placement hints updateExecutorPlacementHints() // If this is the last stage with pending tasks, mark the scheduler queue as empty // This is needed in case the stage is aborted for any reason if (stageIdToNumTasks.isEmpty) { allocationManager.onSchedulerQueueEmpty() if (numRunningTasks != 0) { logWarning("No stages are running, but numRunningTasks != 0") numRunningTasks = 0 } } } } {code} But when the stage's running task finished, numRunningTasks will minus 1, so numRunningTasks be negative, it can cause maxNeeded be negative. {code} override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { val executorId = taskEnd.taskInfo.executorId val taskId = taskEnd.taskInfo.taskId val taskIndex = taskEnd.taskInfo.index val stageId = taskEnd.stageId allocationManager.synchronized { numRunningTasks -= 1 {code} was: When a task fetch failed, the stage will complete and retry, when the stage complete, if (stageIdToNumTasks.isEmpty) { allocationManager.onSchedulerQueueEmpty() if (numRunningTasks != 0) { logWarning("No stages are running, but numRunningTasks != 0") numRunningTasks = 0 } } > ExecutorAllocationManager.numRunningTasks can be negative when stage retry > -- > > Key: SPARK-16708 > URL: https://issues.apache.org/jira/browse/SPARK-16708 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > When a task fetch failed, the stage will complete and retry, when the stage > complete, ExecutorAllocationManager.numRunningTasks will be set 0, here is > the code: > {code} > override def onStageCompleted(stageCompleted: > SparkListenerStageCompleted): Unit = { > val stageId = stageCompleted.stageInfo.stageId > allocationManager.synchronized { > stageIdToNumTasks -= stageId > stageIdToTaskIndices -= stageId > stageIdToExecutorPlacementHints -= stageId > // Update the executor placement hints > updateExecutorPlacementHints() > // If this is the last stage with pending tasks, mark the scheduler > queue as empty > // This is needed in case the stage is aborted for any reason > if (stageIdToNumTasks.isEmpty) { > allocationManager.onSchedulerQueueEmpty() > if (numRunningTasks != 0) { > logWarning("No stages are running, but numRunningTasks != 0") > numRunningTasks = 0 > } > } > } > } > {code} > But when the stage's running task finished, numRunningTasks will minus 1, so > numRunningTasks be negative, it can cause maxNeeded be negative. > {code} > override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = { > val executorId = taskEnd.taskInfo.executorId > val taskId = taskEnd.taskInfo.taskId > val taskIndex = taskEnd.taskInfo.index > val stageId = taskEnd.stageId > allocationManager.synchronized { > numRunningTasks -= 1 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16708) ExecutorAllocationManager.numRunningTasks can be negative when stage retry
[ https://issues.apache.org/jira/browse/SPARK-16708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16708: -- Description: When a task fetch failed, the stage will complete and retry, when the stage complete, if (stageIdToNumTasks.isEmpty) { allocationManager.onSchedulerQueueEmpty() if (numRunningTasks != 0) { logWarning("No stages are running, but numRunningTasks != 0") numRunningTasks = 0 } } > ExecutorAllocationManager.numRunningTasks can be negative when stage retry > -- > > Key: SPARK-16708 > URL: https://issues.apache.org/jira/browse/SPARK-16708 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > When a task fetch failed, the stage will complete and retry, when the stage > complete, > if (stageIdToNumTasks.isEmpty) { > allocationManager.onSchedulerQueueEmpty() > if (numRunningTasks != 0) { > logWarning("No stages are running, but numRunningTasks != 0") > numRunningTasks = 0 > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16708) ExecutorAllocationManager.numRunningTasks can be negative when stage retry
[ https://issues.apache.org/jira/browse/SPARK-16708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16708: -- Affects Version/s: 1.6.0 > ExecutorAllocationManager.numRunningTasks can be negative when stage retry > -- > > Key: SPARK-16708 > URL: https://issues.apache.org/jira/browse/SPARK-16708 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16708) ExecutorAllocationManager.numRunningTasks can be negative when stage retry
Hong Shen created SPARK-16708: - Summary: ExecutorAllocationManager.numRunningTasks can be negative when stage retry Key: SPARK-16708 URL: https://issues.apache.org/jira/browse/SPARK-16708 Project: Spark Issue Type: Bug Reporter: Hong Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16707) TransportClientFactory.createClient may throw NPE
[ https://issues.apache.org/jira/browse/SPARK-16707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391596#comment-15391596 ] Hong Shen commented on SPARK-16707: --- There is a same error here: http://stackoverflow.com/questions/26390968/why-does-the-channel-pipeline-get-returns-a-null-handler-in-netty4 > TransportClientFactory.createClient may throw NPE > - > > Key: SPARK-16707 > URL: https://issues.apache.org/jira/browse/SPARK-16707 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > I have encounter some NullPointerException when > TransportClientFactory.createClient in my cluster, here is the following > stack trace. > {code} > org.apache.spark.shuffle.FetchFailedException > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.NullPointerException > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:144) > at > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:107) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:146) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:126) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:155) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:319) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299) > ... 32 more > {code} > The code is at >
[jira] [Updated] (SPARK-16707) TransportClientFactory.createClient may throw NPE
[ https://issues.apache.org/jira/browse/SPARK-16707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16707: -- Description: I have encounter some NullPointerException when TransportClientFactory.createClient in my cluster, here is the following stack trace. {code} org.apache.spark.shuffle.FetchFailedException at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NullPointerException at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:144) at org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:107) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:146) at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:126) at org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:155) at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:319) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299) ... 32 more {code} The code is at https://github.com/apache/spark/blob/branch-1.6/network/common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java#L144 {code} TransportChannelHandler handler = cachedClient.getChannel().pipeline() .get(TransportChannelHandler.class); synchronized (handler) { {code} synchronized (handler) throw NullPointerException, handler is null. was: I have encounter some NullPointerException when TransportClientFactory.createClient in my cluster, here is the following stack trace. {code} org.apache.spark.shuffle.FetchFailedException at
[jira] [Updated] (SPARK-16707) TransportClientFactory.createClient may throw NPE
[ https://issues.apache.org/jira/browse/SPARK-16707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16707: -- Description: I have encounter some NullPointerException when TransportClientFactory.createClient in my cluster, here is the following stack trace. {code} org.apache.spark.shuffle.FetchFailedException at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NullPointerException at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:144) at org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:107) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:146) at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:126) at org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:155) at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:319) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299) ... 32 more {code} The code is at https://github.com/apache/spark/blob/branch-1.6/network/common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java#L144 {code} if (cachedClient != null && cachedClient.isActive()) { // Make sure that the channel will not timeout by updating the last use time of the // handler. Then check that the client is still alive, in case it timed out before // this code was able to update things. TransportChannelHandler handler = cachedClient.getChannel().pipeline() .get(TransportChannelHandler.class); {color:red} synchronized (handler) { {color}
[jira] [Updated] (SPARK-16707) TransportClientFactory.createClient may throw NPE
[ https://issues.apache.org/jira/browse/SPARK-16707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16707: -- Description: I have encounter some NullPointerException when TransportClientFactory.createClient in my cluster, here is the following stack trace. {code} org.apache.spark.shuffle.FetchFailedException at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NullPointerException at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:144) at org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:107) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:146) at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:126) at org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:155) at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:319) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299) ... 32 more {code} The code is at https://github.com/apache/spark/blob/branch-1.6/network/common/src/main/java/org/apache/spark/network/client/TransportClientFactory.java#L144 was: org.apache.spark.shuffle.FetchFailedException at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
[jira] [Updated] (SPARK-16707) TransportClientFactory.createClient may throw NPE
[ https://issues.apache.org/jira/browse/SPARK-16707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16707: -- Affects Version/s: 1.6.0 > TransportClientFactory.createClient may throw NPE > - > > Key: SPARK-16707 > URL: https://issues.apache.org/jira/browse/SPARK-16707 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > org.apache.spark.shuffle.FetchFailedException > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > Caused by: java.lang.NullPointerException > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:144) > at > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:107) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:146) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:126) > at > org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:155) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:319) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299) > ... 32 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16707) TransportClientFactory.createClient may throw NPE
[ https://issues.apache.org/jira/browse/SPARK-16707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-16707: -- Description: org.apache.spark.shuffle.FetchFailedException at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511) at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) at org.apache.spark.rdd.RDD.iterator(RDD.scala:301) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.NullPointerException at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:144) at org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:107) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:146) at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:126) at org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:116) at org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:155) at org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:319) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299) ... 32 more > TransportClientFactory.createClient may throw NPE > - > > Key: SPARK-16707 > URL: https://issues.apache.org/jira/browse/SPARK-16707 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0 >Reporter: Hong Shen > > org.apache.spark.shuffle.FetchFailedException > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53) > at
[jira] [Created] (SPARK-16707) TransportClientFactory.createClient may throw NPE
Hong Shen created SPARK-16707: - Summary: TransportClientFactory.createClient may throw NPE Key: SPARK-16707 URL: https://issues.apache.org/jira/browse/SPARK-16707 Project: Spark Issue Type: Bug Reporter: Hong Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-13510: -- Attachment: spark-13510.diff Add FetchStream for shuffle, if the fetch file is bigger than spark.shuffle.max.block.size.inmemory, use FetchStream to fetch the file to disk, it will resolve direct buffer memory exception. > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen > Attachments: spark-13510.diff > > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. > If the block is more than we can cache in memory, we should write to disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173675#comment-15173675 ] Hong Shen commented on SPARK-13510: --- I have resolve in our own edition, I will add a pull request this weekend. > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. > If the block is more than we can cache in memory, we should write to disk. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-13510: -- Description: In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed. {code} 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request for 1 blocks (915.4 MB) from 10.196.134.220:7337 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch from 10.196.134.220:7337 (executor id 122) 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 to /10.196.134.220:7337 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in connection from /10.196.134.220:7337 java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:744) 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /10.196.134.220:7337 is closed 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block shuffle_3_81_2, and will not retry (0 retries) {code} The reason is that when shuffle a big block(like 1G), task will allocate the same memory, it will easily throw "FetchFailedException: Direct buffer memory". If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will throw {code} java.lang.OutOfMemoryError: Java heap space at io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) {code} In mapreduce shuffle, it will firstly judge whether the block can cache in memery, but spark doesn't. If the block is more than we can cache in memory, we should write to disk. was: In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed. {code} 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request for 1 blocks (915.4 MB) from 10.196.134.220:7337 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch from 10.196.134.220:7337 (executor id 122) 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 to /10.196.134.220:7337 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in connection from /10.196.134.220:7337 java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) at
[jira] [Updated] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-13510: -- Description: In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed. {code} 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request for 1 blocks (915.4 MB) from 10.196.134.220:7337 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch from 10.196.134.220:7337 (executor id 122) 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 to /10.196.134.220:7337 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in connection from /10.196.134.220:7337 java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:744) 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /10.196.134.220:7337 is closed 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block shuffle_3_81_2, and will not retry (0 retries) {code} The reason is that when shuffle a big block(like 1G), task will allocate the same memory, it will easily throw "FetchFailedException: Direct buffer memory". If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will throw {code} java.lang.OutOfMemoryError: Java heap space at io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) {code} In mapreduce shuffle, it will firstly judge whether the block can cache in memery, but spark doesn't. If the block is more was: In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed. {code} 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request for 1 blocks (915.4 MB) from 10.196.134.220:7337 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch from 10.196.134.220:7337 (executor id 122) 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 to /10.196.134.220:7337 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in connection from /10.196.134.220:7337 java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) at
[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173658#comment-15173658 ] Hong Shen commented on SPARK-13510: --- We can't resolve all the shuffle OOM by allocate more memory, If reduce shuffle a block more than 5GB, it shouldn't restore in memory. > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173648#comment-15173648 ] Hong Shen commented on SPARK-13510: --- In our cluster, we have lot of sql run on hive, I want to use spark sql to replace hive. But there is a lot of sql's input are more the 10TB, shuffe block could be more than 5GB, When I run some sql on spark sql, some sql failed because shuffle OOM, I can't allocate such more memory to resolve all the failed sql. > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen reopened SPARK-13510: --- > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171330#comment-15171330 ] Hong Shen commented on SPARK-13510: --- Add more log, fetch a block of 915.4 MB. 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request for 1 blocks (915.4 MB) from 10.196.134.220:7337 > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-13510: -- Description: In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed. {code} 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request for 1 blocks (915.4 MB) from 10.196.134.220:7337 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch from 10.196.134.220:7337 (executor id 122) 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 to /10.196.134.220:7337 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in connection from /10.196.134.220:7337 java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:744) 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /10.196.134.220:7337 is closed 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block shuffle_3_81_2, and will not retry (0 retries) {code} The reason is that when shuffle a big block(like 1G), task will allocate the same memory, it will easily throw "FetchFailedException: Direct buffer memory". If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will throw {code} java.lang.OutOfMemoryError: Java heap space at io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) {code} In mapreduce shuffle, it will firstly judge whether the block can cache in memery, but spark doesn't. was: In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed. {code} org.apache.spark.shuffle.FetchFailedException: Direct buffer memory at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:167) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) {code} The reason is that when shuffle a big block(like 1G), task will allocate the same memory, it will easily throw "FetchFailedException: Direct
[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168693#comment-15168693 ] Hong Shen commented on SPARK-13510: --- I will add the logic in my edition. > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > org.apache.spark.shuffle.FetchFailedException: Direct buffer memory > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:167) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-13510: -- Description: In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed. {code} org.apache.spark.shuffle.FetchFailedException: Direct buffer memory at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:167) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) {code} The reason is that when shuffle a big block(like 1G), task will allocate the same memory, it will easily throw "FetchFailedException: Direct buffer memory". If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will throw {code} java.lang.OutOfMemoryError: Java heap space at io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) {code} In mapreduce shuffle, it will firstly judge whether the block can cache in memery, but spark doesn't. was: In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed. {code} org.apache.spark.shuffle.FetchFailedException: Direct buffer memory at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:167) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) {code} The reason is that when shuffle a big block(like 1G), task will allocate the same memory, it will easily throw "FetchFailedException: Direct buffer memory". If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will throw {code} java.lang.OutOfMemoryError: Java heap space at io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) {code} In mapreduce shuffle, it will firstly judge whether the block can cache in memery, but spark doesn't. I will add the logic in my edition. > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen > > In our
[jira] [Created] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
Hong Shen created SPARK-13510: - Summary: Shuffle may throw FetchFailedException: Direct buffer memory Key: SPARK-13510 URL: https://issues.apache.org/jira/browse/SPARK-13510 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Hong Shen In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed. {code} org.apache.spark.shuffle.FetchFailedException: Direct buffer memory+details org.apache.spark.shuffle.FetchFailedException: Direct buffer memory at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:167) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) {code} The reason is that when shuffle a big block(like 1G), task will allocate the same memory, it will easily throw "FetchFailedException: Direct buffer memory". If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will throw {code} java.lang.OutOfMemoryError: Java heap space at io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) {code} In mapreduce shuffle, it will firstly judge whether the block can cache in memery, but spark doesn't. I will add the logic in my edition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-13510: -- Description: In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed. {code} org.apache.spark.shuffle.FetchFailedException: Direct buffer memory at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:167) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) {code} The reason is that when shuffle a big block(like 1G), task will allocate the same memory, it will easily throw "FetchFailedException: Direct buffer memory". If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will throw {code} java.lang.OutOfMemoryError: Java heap space at io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) {code} In mapreduce shuffle, it will firstly judge whether the block can cache in memery, but spark doesn't. I will add the logic in my edition. was: In our cluster, when I test spark-1.6.0 with a sql, it throw exception and failed. {code} org.apache.spark.shuffle.FetchFailedException: Direct buffer memory+details org.apache.spark.shuffle.FetchFailedException: Direct buffer memory at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:167) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:759) {code} The reason is that when shuffle a big block(like 1G), task will allocate the same memory, it will easily throw "FetchFailedException: Direct buffer memory". If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will throw {code} java.lang.OutOfMemoryError: Java heap space at io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) {code} In mapreduce shuffle, it will firstly judge whether the block can cache in memery, but spark doesn't. I will add the logic in my edition. > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug
[jira] [Commented] (SPARK-13450) SortMergeJoin will OOM when join rows have lot of same keys
[ https://issues.apache.org/jira/browse/SPARK-13450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162858#comment-15162858 ] Hong Shen commented on SPARK-13450: --- Thanks, I will add in my own branch first. > SortMergeJoin will OOM when join rows have lot of same keys > --- > > Key: SPARK-13450 > URL: https://issues.apache.org/jira/browse/SPARK-13450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Hong Shen > > When I run a sql with join, task throw java.lang.OutOfMemoryError and sql > failed. I have set spark.executor.memory 4096m. > SortMergeJoin use a ArrayBuffer[InternalRow] to store bufferedMatches, if > the join rows have a lot of same key, it will throw OutOfMemoryError. > {code} > /** Buffered rows from the buffered side of the join. This is empty if > there are no matches. */ > private[this] val bufferedMatches: ArrayBuffer[InternalRow] = new > ArrayBuffer[InternalRow] > {code} > Here is the stackTrace: > {code} > org.xerial.snappy.SnappyNative.arrayCopy(Native Method) > org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84) > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) > java.io.DataInputStream.readFully(DataInputStream.java:195) > java.io.DataInputStream.readLong(DataInputStream.java:416) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:300) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.bufferMatchingRows(SortMergeJoin.scala:329) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:229) > org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:105) > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88) > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > org.apache.spark.scheduler.Task.run(Task.scala:89) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13450) SortMergeJoin will OOM when join rows have lot of same keys
[ https://issues.apache.org/jira/browse/SPARK-13450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15160271#comment-15160271 ] Hong Shen commented on SPARK-13450: --- A join has a lot of rows with the same key. > SortMergeJoin will OOM when join rows have lot of same keys > --- > > Key: SPARK-13450 > URL: https://issues.apache.org/jira/browse/SPARK-13450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Hong Shen > > When I run a sql with join, task throw java.lang.OutOfMemoryError and sql > failed. I have set spark.executor.memory 4096m. > SortMergeJoin use a ArrayBuffer[InternalRow] to store bufferedMatches, if > the join rows have a lot of same key, it will throw OutOfMemoryError. > {code} > /** Buffered rows from the buffered side of the join. This is empty if > there are no matches. */ > private[this] val bufferedMatches: ArrayBuffer[InternalRow] = new > ArrayBuffer[InternalRow] > {code} > Here is the stackTrace: > {code} > org.xerial.snappy.SnappyNative.arrayCopy(Native Method) > org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84) > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) > java.io.DataInputStream.readFully(DataInputStream.java:195) > java.io.DataInputStream.readLong(DataInputStream.java:416) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:300) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.bufferMatchingRows(SortMergeJoin.scala:329) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:229) > org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:105) > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88) > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > org.apache.spark.scheduler.Task.run(Task.scala:89) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13450) SortMergeJoin will OOM when join rows have lot of same keys
[ https://issues.apache.org/jira/browse/SPARK-13450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158566#comment-15158566 ] Hong Shen commented on SPARK-13450: --- I think we should add a ExternalAppendOnlyArrayBuffer to replace ArrayBuffer. I will add in my own branch first. > SortMergeJoin will OOM when join rows have lot of same keys > --- > > Key: SPARK-13450 > URL: https://issues.apache.org/jira/browse/SPARK-13450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Hong Shen > > When I run a sql with join, task throw java.lang.OutOfMemoryError and sql > failed. I have set spark.executor.memory 4096m. > SortMergeJoin use a ArrayBuffer[InternalRow] to store bufferedMatches, if > the join rows have a lot of same key, it will throw OutOfMemoryError. > {code} > /** Buffered rows from the buffered side of the join. This is empty if > there are no matches. */ > private[this] val bufferedMatches: ArrayBuffer[InternalRow] = new > ArrayBuffer[InternalRow] > {code} > Here is the stackTrace: > {code} > org.xerial.snappy.SnappyNative.arrayCopy(Native Method) > org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84) > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) > java.io.DataInputStream.readFully(DataInputStream.java:195) > java.io.DataInputStream.readLong(DataInputStream.java:416) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) > org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) > org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:300) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.bufferMatchingRows(SortMergeJoin.scala:329) > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:229) > org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:105) > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88) > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) > org.apache.spark.rdd.RDD.iterator(RDD.scala:301) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > org.apache.spark.scheduler.Task.run(Task.scala:89) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:744) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13450) SortMergeJoin will OOM when join rows have lot of same keys
[ https://issues.apache.org/jira/browse/SPARK-13450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-13450: -- Description: When I run a sql with join, task throw java.lang.OutOfMemoryError and sql failed. I have set spark.executor.memory 4096m. SortMergeJoin use a ArrayBuffer[InternalRow] to store bufferedMatches, if the join rows have a lot of same key, it will throw OutOfMemoryError. {code} /** Buffered rows from the buffered side of the join. This is empty if there are no matches. */ private[this] val bufferedMatches: ArrayBuffer[InternalRow] = new ArrayBuffer[InternalRow] {code} Here is the stackTrace: {code} org.xerial.snappy.SnappyNative.arrayCopy(Native Method) org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) java.io.DataInputStream.readFully(DataInputStream.java:195) java.io.DataInputStream.readLong(DataInputStream.java:416) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84) org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:300) org.apache.spark.sql.execution.joins.SortMergeJoinScanner.bufferMatchingRows(SortMergeJoin.scala:329) org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:229) org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:105) org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88) org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) org.apache.spark.rdd.RDD.iterator(RDD.scala:301) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) org.apache.spark.rdd.RDD.iterator(RDD.scala:301) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:89) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) {code} was: When I run a sql with join, task throw java.lang.OutOfMemoryError and sql failed. I have set spark.executor.memory 4096m. SortMergeJoin use a ArrayBuffer[InternalRow] to store bufferedMatches, if the join rows have a lot of same key, it will throw OutOfMemoryError. {code:title=Bar.java|borderStyle=solid} /** Buffered rows from the buffered side of the join. This is empty if there are no matches. */ private[this] val bufferedMatches: ArrayBuffer[InternalRow] = new ArrayBuffer[InternalRow] {code} Here is the stackTrace: org.xerial.snappy.SnappyNative.arrayCopy(Native Method) org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) java.io.DataInputStream.readFully(DataInputStream.java:195) java.io.DataInputStream.readLong(DataInputStream.java:416) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84)
[jira] [Created] (SPARK-13450) SortMergeJoin will OOM when join rows have lot of same keys
Hong Shen created SPARK-13450: - Summary: SortMergeJoin will OOM when join rows have lot of same keys Key: SPARK-13450 URL: https://issues.apache.org/jira/browse/SPARK-13450 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Hong Shen When I run a sql with join, task throw java.lang.OutOfMemoryError and sql failed. I have set spark.executor.memory 4096m. SortMergeJoin use a ArrayBuffer[InternalRow] to store bufferedMatches, if the join rows have a lot of same key, it will throw OutOfMemoryError. {code:title=Bar.java|borderStyle=solid} /** Buffered rows from the buffered side of the join. This is empty if there are no matches. */ private[this] val bufferedMatches: ArrayBuffer[InternalRow] = new ArrayBuffer[InternalRow] {code} Here is the stackTrace: org.xerial.snappy.SnappyNative.arrayCopy(Native Method) org.xerial.snappy.Snappy.arrayCopy(Snappy.java:84) org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:190) org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163) java.io.DataInputStream.readFully(DataInputStream.java:195) java.io.DataInputStream.readLong(DataInputStream.java:416) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.loadNext(UnsafeSorterSpillReader.java:71) org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillMerger$2.loadNext(UnsafeSorterSpillMerger.java:79) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:136) org.apache.spark.sql.execution.UnsafeExternalRowSorter$1.next(UnsafeExternalRowSorter.java:123) org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:84) org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedBufferedToRowWithNullFreeJoinKey(SortMergeJoin.scala:300) org.apache.spark.sql.execution.joins.SortMergeJoinScanner.bufferMatchingRows(SortMergeJoin.scala:329) org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoin.scala:229) org.apache.spark.sql.execution.joins.SortMergeJoin$$anonfun$doExecute$1$$anon$1.advanceNext(SortMergeJoin.scala:105) org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88) org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) org.apache.spark.rdd.RDD.iterator(RDD.scala:301) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337) org.apache.spark.rdd.RDD.iterator(RDD.scala:301) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:89) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7271) Redesign shuffle interface for binary processing
[ https://issues.apache.org/jira/browse/SPARK-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14960363#comment-14960363 ] Hong Shen commented on SPARK-7271: -- Hi, I have a question, are you plan to rededign the shuffle reader to implement binary processing? If so, when will you complete it? > Redesign shuffle interface for binary processing > > > Key: SPARK-7271 > URL: https://issues.apache.org/jira/browse/SPARK-7271 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Josh Rosen > > Current shuffle interface is not exactly ideal for binary processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10918) Task failed because executor kill by driver
Hong Shen created SPARK-10918: - Summary: Task failed because executor kill by driver Key: SPARK-10918 URL: https://issues.apache.org/jira/browse/SPARK-10918 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.1 Reporter: Hong Shen When dynamicAllocation is enabled, when a executor was idle timeout, it will be kill by driver, if a task offer to the executor at the same time, the task will failed due to executor lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8403) Pruner partition won't effective when partition field and fieldSchema exist in sql predicate
[ https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-8403: - Description: When partition field and fieldSchema exist in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} Table t_dw_qqlive_209026 is partition by imp_date, itimestamp is a fieldSchema in t_dw_qqlive_209026. When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table t_dw_qqlive_209026. was: When partition field and fieldSchema exist in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table t_dw_qqlive_209026. Pruner partition won't effective when partition field and fieldSchema exist in sql predicate Key: SPARK-8403 URL: https://issues.apache.org/jira/browse/SPARK-8403 Project: Spark Issue Type: Bug Components: SQL Reporter: Hong Shen When partition field and fieldSchema exist in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} Table t_dw_qqlive_209026 is partition by imp_date, itimestamp is a fieldSchema in t_dw_qqlive_209026. When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table t_dw_qqlive_209026. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8403) Pruner partition won't effective when partition field and fieldSchema exit in sql predicate
[ https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-8403: - Description: When partition field and fieldSchema exist in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table t_dw_qqlive_209026. was: When udf exit in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table t_dw_qqlive_209026. Pruner partition won't effective when partition field and fieldSchema exit in sql predicate --- Key: SPARK-8403 URL: https://issues.apache.org/jira/browse/SPARK-8403 Project: Spark Issue Type: Bug Components: SQL Reporter: Hong Shen When partition field and fieldSchema exist in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table t_dw_qqlive_209026. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8403) Pruner partition won't effective when partition field and fieldSchema exist in sql predicate
[ https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-8403: - Summary: Pruner partition won't effective when partition field and fieldSchema exist in sql predicate (was: Pruner partition won't effective when partition field and fieldSchema exit in sql predicate) Pruner partition won't effective when partition field and fieldSchema exist in sql predicate Key: SPARK-8403 URL: https://issues.apache.org/jira/browse/SPARK-8403 Project: Spark Issue Type: Bug Components: SQL Reporter: Hong Shen When partition field and fieldSchema exist in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table t_dw_qqlive_209026. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8403) Pruner partition won't effective when partition field and fieldSchema exit in sql predicate
[ https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-8403: - Summary: Pruner partition won't effective when partition field and fieldSchema exit in sql predicate (was: Pruner partition won't effective when udf exit in sql predicates) Pruner partition won't effective when partition field and fieldSchema exit in sql predicate --- Key: SPARK-8403 URL: https://issues.apache.org/jira/browse/SPARK-8403 Project: Spark Issue Type: Bug Components: SQL Reporter: Hong Shen When udf exit in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table t_dw_qqlive_209026. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8403) Pruner partition won't effective when udf exit in sql predicates
Hong Shen created SPARK-8403: Summary: Pruner partition won't effective when udf exit in sql predicates Key: SPARK-8403 URL: https://issues.apache.org/jira/browse/SPARK-8403 Project: Spark Issue Type: Bug Components: SQL Reporter: Hong Shen When udf exit in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table fromt_dw_qqlive_209026. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8403) Pruner partition won't effective when udf exit in sql predicates
[ https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-8403: - Description: When udf exit in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table t_dw_qqlive_209026. was: When udf exit in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table from t_dw_qqlive_209026. Pruner partition won't effective when udf exit in sql predicates Key: SPARK-8403 URL: https://issues.apache.org/jira/browse/SPARK-8403 Project: Spark Issue Type: Bug Components: SQL Reporter: Hong Shen When udf exit in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table t_dw_qqlive_209026. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8403) Pruner partition won't effective when udf exit in sql predicates
[ https://issues.apache.org/jira/browse/SPARK-8403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-8403: - Description: When udf exit in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table from t_dw_qqlive_209026. was: When udf exit in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table fromt_dw_qqlive_209026. Pruner partition won't effective when udf exit in sql predicates Key: SPARK-8403 URL: https://issues.apache.org/jira/browse/SPARK-8403 Project: Spark Issue Type: Bug Components: SQL Reporter: Hong Shen When udf exit in sql predicates, pruner partition won't effective. Here is the sql, {code} select r.uin,r.vid,r.ctype,r.bakstr2,r.cmd from t_dw_qqlive_209026 r where r.cmd = 2 and (r.imp_date = 20150615 or and hour(r.itimestamp)16) {code} When run on hive, it will only scan data in partition 20150615, but if run on spark sql, it will scan the whole table from t_dw_qqlive_209026. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512244#comment-14512244 ] Hong Shen commented on SPARK-5529: -- 1.4.0 version would be release in june. BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size, there is a bug in was: ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size, there is a bug in -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen reopened SPARK-6738: -- There is a in SizeEstimator EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size, there is a bug in SizeEstimator.visitArray. was: ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size, there is a bug in EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size, there is a bug in SizeEstimator.visitArray. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Comment Edited] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504202#comment-14504202 ] Hong Shen edited comment on SPARK-6738 at 4/21/15 2:54 AM: --- There is a bug in SizeEstimator was (Author: shenhong): There is a in SizeEstimator EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6738) EstimateSize is difference with spill file size
Hong Shen created SPARK-6738: Summary: EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 1100M data to disk: 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) But the file size is only 1.1M. [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} Here is the other spilled file. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/* /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0 -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43 -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51 {code} The estimateSize is hugh difference with spill file size was: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far)
[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483011#comment-14483011 ] Hong Shen commented on SPARK-6738: -- Yes, it spill lots of files, but each one has only 1.1M. EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} Here is the other spilled file. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/* /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0 -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43 -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51 {code} The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} Here are the other spilled files. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/* /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0 -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43 -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51 {code} The estimateSize is hugh difference with spill file size was: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far)
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} Here is the other spilled files. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/* /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0 -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/24: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_7a3619cf-20e1-4815-8b1f-faeefde40d73 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/2e: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_1e366a65-ced7-4b17-a085-5f873ff6dc43 -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_6772b815-11ee-413a-bc6c-d0dc9cbffc51 {code} The estimateSize is hugh difference with spill file size was: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far)
[jira] [Updated] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6738: - Description: ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size was: ExternalAppendOnlyMap spill 1100M data to disk: {code} 15/04/07 16:39:48 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.5 MB to disk (12 times so far) /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-994b/30/temp_local_e4347165-6263-4678-9f1d-67ad4bcd8fb5 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.3 MB to disk (13 times so far) /data6/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-1e29/26/temp_local_76f9900b-1b3d-4cef-b3a2-6afcde14bbd9 15/04/07 16:39:49 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1105.8 MB to disk (14 times so far) /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b 15/04/07 16:39:50 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 1106.8 MB to disk (15 times so far) {code} But the file size is only 1.1M. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b -rw-r- 1 spark users 1.1M Apr 7 16:39 /data7/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-f883/26/temp_local_3ade0aec-ac1d-469d-bc99-b6fa87cb649b {code} Here are the other spilled files. {code} [tdwadmin@tdw-10-215-149-231 ~/tdwenv/tdwgaia/logs/container-logs/application_1423737010718_40308573/container_1423737010718_40308573_01_08]$ ll -h /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/* /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/09: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_3a568e10-3997-4d13-adf1-e8dfe4ba4727 /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/18: total 2.2M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_66c0df48-5d79-448b-8989-84ce1a5507d0 -rw-r- 1 spark users 1.1M Apr 7 16:39 temp_local_f6870214-bfd5-47b2-b0b9-37194b55761b /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1a: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_ba1712d2-0eb8-4833-9fa6-a87ee670826c /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1b: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_1d1df5b7-846c-4bcd-a9de-c328e50e62db /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/1f: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:41 temp_local_38c6a144-b588-49b1-b0f0-b91b31c2e85f /data3/yarnenv/local/usercache/spark/appcache/application_1423737010718_40308573/spark-local-20150407163931-fe54/20: total 1.1M -rw-r- 1 spark users 1.1M Apr 7 16:40 temp_local_d816a301-7520-4b6e-8866-bc445f04f0c9
[jira] [Commented] (SPARK-6738) EstimateSize is difference with spill file size
[ https://issues.apache.org/jira/browse/SPARK-6738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14483104#comment-14483104 ] Hong Shen commented on SPARK-6738: -- I don't think it's serialized cause the problem. the input data is a hive table, and the spark job is a spark SQL. In the fact, when the log show that spilling in-memory map of 2.2 GB to disk, the file is only 2.2M, and the GC log show the jvm is less than 1GB. the estimateSize also deviation with the jvm memory. EstimateSize is difference with spill file size Key: SPARK-6738 URL: https://issues.apache.org/jira/browse/SPARK-6738 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen ExternalAppendOnlyMap spill 2.2 GB data to disk: {code} 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: Thread 54 spilling in-memory map of 2.2 GB to disk (61 times so far) 15/04/07 20:27:37 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} But the file size is only 2.2M. {code} ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1423737010718_40455651/spark-local-20150407202613-4e80/11/ total 2.2M -rw-r- 1 spark users 2.2M Apr 7 20:27 temp_local_fdb4a583-5d13-4394-bccb-e1217d5db812 {code} The GC log show that the jvm memory is less than 1GB. {code} 2015-04-07T20:27:08.023+0800: [GC 981981K-55363K(3961344K), 0.0341720 secs] 2015-04-07T20:27:14.483+0800: [GC 987523K-53737K(3961344K), 0.0252660 secs] 2015-04-07T20:27:20.793+0800: [GC 985897K-56370K(3961344K), 0.0606460 secs] 2015-04-07T20:27:27.553+0800: [GC 988530K-59089K(3961344K), 0.0651840 secs] 2015-04-07T20:27:34.067+0800: [GC 991249K-62153K(3961344K), 0.0288460 secs] 2015-04-07T20:27:40.180+0800: [GC 994313K-61344K(3961344K), 0.0388970 secs] 2015-04-07T20:27:46.490+0800: [GC 993504K-59915K(3961344K), 0.0235150 secs] {code} The estimateSize is hugh difference with spill file size -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6104) spark SQL shuffle OOM
[ https://issues.apache.org/jira/browse/SPARK-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14350323#comment-14350323 ] Hong Shen commented on SPARK-6104: -- [~rxin] Can you link to the duplicate issue? spark SQL shuffle OOM - Key: SPARK-6104 URL: https://issues.apache.org/jira/browse/SPARK-6104 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Hong Shen Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark SQL's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark SQL, but it often appear OOM, it's hard to reset spark.default.parallelism every day. I think we should add aggregator for spark SQL shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6104) spark sql shuffle OOM
[ https://issues.apache.org/jira/browse/SPARK-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6104: - Affects Version/s: 1.2.0 spark sql shuffle OOM - Key: SPARK-6104 URL: https://issues.apache.org/jira/browse/SPARK-6104 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Hong Shen Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark sql's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark sql, but it often appear OOM, it's hard to reset -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6104) spark sql shuffle OOM
[ https://issues.apache.org/jira/browse/SPARK-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6104: - Description: Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark sql's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark sql, but it often appear OOM, it's hard to reset was:Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark sql's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. spark sql shuffle OOM - Key: SPARK-6104 URL: https://issues.apache.org/jira/browse/SPARK-6104 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Hong Shen Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark sql's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark sql, but it often appear OOM, it's hard to reset -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6104) spark sql shuffle OOM
[ https://issues.apache.org/jira/browse/SPARK-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6104: - Description: Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark sql's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark sql, but it often appear OOM, it's hard to reset spark.default.parallelism every day. I think we should add aggregator for spark sql shuffle. was: Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark sql's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark sql, but it often appear OOM, it's hard to reset spark sql shuffle OOM - Key: SPARK-6104 URL: https://issues.apache.org/jira/browse/SPARK-6104 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Hong Shen Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark sql's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark sql, but it often appear OOM, it's hard to reset spark.default.parallelism every day. I think we should add aggregator for spark sql shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6104) spark SQL shuffle OOM
[ https://issues.apache.org/jira/browse/SPARK-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6104: - Summary: spark SQL shuffle OOM (was: spark sql shuffle OOM) spark SQL shuffle OOM - Key: SPARK-6104 URL: https://issues.apache.org/jira/browse/SPARK-6104 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Hong Shen Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark sql's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark sql, but it often appear OOM, it's hard to reset spark.default.parallelism every day. I think we should add aggregator for spark sql shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6104) spark SQL shuffle OOM
[ https://issues.apache.org/jira/browse/SPARK-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342863#comment-14342863 ] Hong Shen commented on SPARK-6104: -- [~rxin] [~sandyryza], can you pay attention to this issue? spark SQL shuffle OOM - Key: SPARK-6104 URL: https://issues.apache.org/jira/browse/SPARK-6104 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Hong Shen Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark SQL's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark SQL, but it often appear OOM, it's hard to reset spark.default.parallelism every day. I think we should add aggregator for spark SQL shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6104) spark SQL shuffle OOM
[ https://issues.apache.org/jira/browse/SPARK-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-6104: - Description: Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark SQL's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark SQL, but it often appear OOM, it's hard to reset spark.default.parallelism every day. I think we should add aggregator for spark SQL shuffle. was: Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark sql's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark sql, but it often appear OOM, it's hard to reset spark.default.parallelism every day. I think we should add aggregator for spark sql shuffle. spark SQL shuffle OOM - Key: SPARK-6104 URL: https://issues.apache.org/jira/browse/SPARK-6104 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Hong Shen Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark SQL's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark SQL, but it often appear OOM, it's hard to reset spark.default.parallelism every day. I think we should add aggregator for spark SQL shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6104) spark SQL shuffle OOM
[ https://issues.apache.org/jira/browse/SPARK-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342872#comment-14342872 ] Hong Shen commented on SPARK-6104: -- Can you link to the Duplicate issue? spark SQL shuffle OOM - Key: SPARK-6104 URL: https://issues.apache.org/jira/browse/SPARK-6104 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Hong Shen Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark SQL's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark SQL, but it often appear OOM, it's hard to reset spark.default.parallelism every day. I think we should add aggregator for spark SQL shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6104) spark SQL shuffle OOM
[ https://issues.apache.org/jira/browse/SPARK-6104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14342872#comment-14342872 ] Hong Shen edited comment on SPARK-6104 at 3/2/15 7:25 AM: -- Can you link to the duplicate issue? was (Author: shenhong): Can you link to the Duplicate issue? spark SQL shuffle OOM - Key: SPARK-6104 URL: https://issues.apache.org/jira/browse/SPARK-6104 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Hong Shen Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark SQL's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. In our cluster, we have used spark SQL, but it often appear OOM, it's hard to reset spark.default.parallelism every day. I think we should add aggregator for spark SQL shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6104) spark sql shuffle OOM
Hong Shen created SPARK-6104: Summary: spark sql shuffle OOM Key: SPARK-6104 URL: https://issues.apache.org/jira/browse/SPARK-6104 Project: Spark Issue Type: Improvement Reporter: Hong Shen Currently, spark shuffle can use ExternalAppendOnlyMap to combine data that have fetched, but it depend on ShuffleDependency's aggregator, while spark sql's shuffle have not define aggregator, so it will easily cause OOM in large scale of data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5736) Add executor log url
Hong Shen created SPARK-5736: Summary: Add executor log url Key: SPARK-5736 URL: https://issues.apache.org/jira/browse/SPARK-5736 Project: Spark Issue Type: Bug Reporter: Hong Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5736) Add executor log url to Executors page on Yarn
[ https://issues.apache.org/jira/browse/SPARK-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-5736: - Summary: Add executor log url to Executors page on Yarn (was: Add executor log url to Executors page) Add executor log url to Executors page on Yarn -- Key: SPARK-5736 URL: https://issues.apache.org/jira/browse/SPARK-5736 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: Hong Shen Currently, there is not executor log url in spark ui (on Yarn), we have to read executor log by login the machine that executor in. I think we should add executor log url to executors pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5736) Add executor log url to Executors page
[ https://issues.apache.org/jira/browse/SPARK-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-5736: - Component/s: Web UI Add executor log url to Executors page -- Key: SPARK-5736 URL: https://issues.apache.org/jira/browse/SPARK-5736 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: Hong Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5736) Add executor log url to Executors page
[ https://issues.apache.org/jira/browse/SPARK-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-5736: - Affects Version/s: 1.2.0 Add executor log url to Executors page -- Key: SPARK-5736 URL: https://issues.apache.org/jira/browse/SPARK-5736 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: Hong Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5736) Add executor log url to Executors page
[ https://issues.apache.org/jira/browse/SPARK-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-5736: - Summary: Add executor log url to Executors page (was: Add executor log url) Add executor log url to Executors page -- Key: SPARK-5736 URL: https://issues.apache.org/jira/browse/SPARK-5736 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Reporter: Hong Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5736) Add executor log url to Executors page
[ https://issues.apache.org/jira/browse/SPARK-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-5736: - Issue Type: Improvement (was: Bug) Add executor log url to Executors page -- Key: SPARK-5736 URL: https://issues.apache.org/jira/browse/SPARK-5736 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: Hong Shen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5736) Add executor log url to Executors page
[ https://issues.apache.org/jira/browse/SPARK-5736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-5736: - Description: Currently, there is not executor log url in spark ui (on Yarn), we have to read executor log by login the machine that executor in. I think we should add executor log url to executors pages. Add executor log url to Executors page -- Key: SPARK-5736 URL: https://issues.apache.org/jira/browse/SPARK-5736 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: Hong Shen Currently, there is not executor log url in spark ui (on Yarn), we have to read executor log by login the machine that executor in. I think we should add executor log url to executors pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304819#comment-14304819 ] Hong Shen commented on SPARK-5529: -- I had changed it in our own branch. Executor is still hold while BlockManager has been removed -- Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong Shen updated SPARK-5529: - Attachment: SPARK-5529.patch Executor is still hold while BlockManager has been removed -- Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) Executor is still hold while BlockManager has been removed
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14304433#comment-14304433 ] Hong Shen commented on SPARK-5529: -- Executor will lost when a akka throw a disassociatedEvent. Executor is still hold while BlockManager has been removed -- Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Hong Shen When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org