[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
[ https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272287#comment-16272287 ] liyunzhang commented on SPARK-22660: not support JDK9 now , i am working on it. > Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9 > > > Key: SPARK-22660 > URL: https://issues.apache.org/jira/browse/SPARK-22660 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > Based on SPARK-22659 > 1. compile with -Pscala-2.12 and get the error > {code} > Use position() and limit() to fix ambiguity issue > {code} > spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: > ambiguous reference to overloaded definition, method limit in class > ByteBuffer of type (x$1: Int)java.nio.ByteBuffer > method limit in class Buffer of type ()Int > match expected type ? > val resultSize = serializedDirectResult.limit > error > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
[ https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272249#comment-16272249 ] liyunzhang edited comment on SPARK-22660 at 11/30/17 7:39 AM: -- some new error {code} error] /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415: ambiguous reference to overloaded definition, [error] both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit [error] match argument types (java.util.Map[String,String]) [error] properties.putAll(propsMap.asJava) [error]^ [error] /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427: ambiguous reference to overloaded definition, [error] both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit [error] match argument types (java.util.Map[String,String]) [error] props.putAll(outputSerdeProps.toMap.asJava) [error] ^ {code} The key type is Object instead of String, which is unsafe. was (Author: kellyzly): some new error {code} error] /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415: ambiguous reference to overloaded definition, [error] both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit [error] match argument types (java.util.Map[String,String]) [error] properties.putAll(propsMap.asJava) [error]^ [error] /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427: ambiguous reference to overloaded definition, [error] both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit [error] match argument types (java.util.Map[String,String]) [error] props.putAll(outputSerdeProps.toMap.asJava) [error] ^ {code} > Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9 > > > Key: SPARK-22660 > URL: https://issues.apache.org/jira/browse/SPARK-22660 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > Based on SPARK-22659 > 1. compile with -Pscala-2.12 and get the error > {code} > Use position() and limit() to fix ambiguity issue > {code} > spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: > ambiguous reference to overloaded definition, method limit in class > ByteBuffer of type (x$1: Int)java.nio.ByteBuffer > method limit in class Buffer of type ()Int > match expected type ? > val resultSize = serializedDirectResult.limit > error > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
[ https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272283#comment-16272283 ] Liang-Chi Hsieh commented on SPARK-22660: - For the error you ping me, from the error message, looks like you can try to add {{import scala.language.reflectiveCalls}}? Btw, are we supporting JDK9? > Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9 > > > Key: SPARK-22660 > URL: https://issues.apache.org/jira/browse/SPARK-22660 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > Based on SPARK-22659 > 1. compile with -Pscala-2.12 and get the error > {code} > Use position() and limit() to fix ambiguity issue > {code} > spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: > ambiguous reference to overloaded definition, method limit in class > ByteBuffer of type (x$1: Int)java.nio.ByteBuffer > method limit in class Buffer of type ()Int > match expected type ? > val resultSize = serializedDirectResult.limit > error > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
[ https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272281#comment-16272281 ] liyunzhang commented on SPARK-22660: [~viirya]: the error mentioned above does not exist any more when i rebuilt. Sorry if you have spent time on it. > Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9 > > > Key: SPARK-22660 > URL: https://issues.apache.org/jira/browse/SPARK-22660 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > Based on SPARK-22659 > 1. compile with -Pscala-2.12 and get the error > {code} > Use position() and limit() to fix ambiguity issue > {code} > spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: > ambiguous reference to overloaded definition, method limit in class > ByteBuffer of type (x$1: Int)java.nio.ByteBuffer > method limit in class Buffer of type ()Int > match expected type ? > val resultSize = serializedDirectResult.limit > error > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
[ https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272249#comment-16272249 ] liyunzhang commented on SPARK-22660: some new error {code} error] /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415: ambiguous reference to overloaded definition, [error] both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit [error] match argument types (java.util.Map[String,String]) [error] properties.putAll(propsMap.asJava) [error]^ [error] /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427: ambiguous reference to overloaded definition, [error] both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit [error] match argument types (java.util.Map[String,String]) [error] props.putAll(outputSerdeProps.toMap.asJava) [error] ^ {code} > Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9 > > > Key: SPARK-22660 > URL: https://issues.apache.org/jira/browse/SPARK-22660 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > Based on SPARK-22659 > 1. compile with -Pscala-2.12 and get the error > {code} > Use position() and limit() to fix ambiguity issue > {code} > spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: > ambiguous reference to overloaded definition, method limit in class > ByteBuffer of type (x$1: Int)java.nio.ByteBuffer > method limit in class Buffer of type ()Int > match expected type ? > val resultSize = serializedDirectResult.limit > error > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
[ https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272225#comment-16272225 ] liyunzhang commented on SPARK-22660: Besides above error, there are another errors with scala-2.12 {code} /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite.scala:151: reflective access of structural type member method getData should be enabled by making the implicit value scala.language.reflectiveCalls visible. 64275 This can be achieved by adding the import clause 'import scala.language.reflectiveCalls' 64276 or by setting the compiler option -language:reflectiveCalls. 64277 See the Scaladoc for value scala.language.reflectiveCalls for a discussion 64278 why the feature should be explicitly enabled. 64279val rdd = sc.parallelize(1 to 1).map(concreteObject.getData) 64280^ 64281 /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite.scala:175: reflective access of structural type member value innerObject2 should be enabled 64282 by making the implicit value scala.language.reflectiveCalls visible. 64283val rdd = sc.parallelize(1 to 1).map(concreteObject.innerObject2.getData) 64284^ 64285 /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite.scala:175: reflective access of structural type member method getData should be enabled 64286 by making the implicit value scala.language.reflectiveCalls visible. 64287val rdd = sc.parallelize(1 to 1).map(concreteObject.innerObject2.getData) {code} [~viirya]:As you are familar with SPARK-22328, do you know how to fix it? > Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9 > > > Key: SPARK-22660 > URL: https://issues.apache.org/jira/browse/SPARK-22660 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > Based on SPARK-22659 > 1. compile with -Pscala-2.12 and get the error > {code} > Use position() and limit() to fix ambiguity issue > {code} > spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: > ambiguous reference to overloaded definition, method limit in class > ByteBuffer of type (x$1: Int)java.nio.ByteBuffer > method limit in class Buffer of type ()Int > match expected type ? > val resultSize = serializedDirectResult.limit > error > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
liyunzhang created SPARK-22660: -- Summary: Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9 Key: SPARK-22660 URL: https://issues.apache.org/jira/browse/SPARK-22660 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.2.0 Reporter: liyunzhang Based on SPARK-22659 1. compile with -Pscala-2.12 and get the error {code} Use position() and limit() to fix ambiguity issue {code} spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: ambiguous reference to overloaded definition, method limit in class ByteBuffer of type (x$1: Int)java.nio.ByteBuffer method limit in class Buffer of type ()Int match expected type ? val resultSize = serializedDirectResult.limit error {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22659) remove sun.misc.Cleaner references
[ https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22659: Assignee: (was: Apache Spark) > remove sun.misc.Cleaner references > -- > > Key: SPARK-22659 > URL: https://issues.apache.org/jira/browse/SPARK-22659 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > build with scala-2.12 with following steps > 1. change the pom.xml with scala-2.12 > {code} > ./dev/change-scala-version.sh 2.12 > {code} > 2.build with -Pscala-2.12 > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn > -Pparquet-provided -Dhadoop.version=2.7.3 > {code} > get the error > {code} > /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: > error: cannot find symbol > Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); > {code} > This is because sun.misc.Cleaner has been moved to new location in JDK9. > HADOOP-12760 will be the long term fix -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22659) remove sun.misc.Cleaner references
[ https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272187#comment-16272187 ] Apache Spark commented on SPARK-22659: -- User 'kellyzly' has created a pull request for this issue: https://github.com/apache/spark/pull/19853 > remove sun.misc.Cleaner references > -- > > Key: SPARK-22659 > URL: https://issues.apache.org/jira/browse/SPARK-22659 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > build with scala-2.12 with following steps > 1. change the pom.xml with scala-2.12 > {code} > ./dev/change-scala-version.sh 2.12 > {code} > 2.build with -Pscala-2.12 > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn > -Pparquet-provided -Dhadoop.version=2.7.3 > {code} > get the error > {code} > /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: > error: cannot find symbol > Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); > {code} > This is because sun.misc.Cleaner has been moved to new location in JDK9. > HADOOP-12760 will be the long term fix -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22659) remove sun.misc.Cleaner references
[ https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22659: Assignee: Apache Spark > remove sun.misc.Cleaner references > -- > > Key: SPARK-22659 > URL: https://issues.apache.org/jira/browse/SPARK-22659 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang >Assignee: Apache Spark > > build with scala-2.12 with following steps > 1. change the pom.xml with scala-2.12 > {code} > ./dev/change-scala-version.sh 2.12 > {code} > 2.build with -Pscala-2.12 > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn > -Pparquet-provided -Dhadoop.version=2.7.3 > {code} > get the error > {code} > /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: > error: cannot find symbol > Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); > {code} > This is because sun.misc.Cleaner has been moved to new location in JDK9. > HADOOP-12760 will be the long term fix -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22659) remove sun.misc.Cleaner references
[ https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272177#comment-16272177 ] liyunzhang commented on SPARK-22659: I am very confused about why this issue exists with java.version is 1.8 in pom.xml. {code} #grep -C2 java.version pom.xml UTF-8 UTF-8 1.8 ${java.version} ${java.version} {code} > remove sun.misc.Cleaner references > -- > > Key: SPARK-22659 > URL: https://issues.apache.org/jira/browse/SPARK-22659 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > build with scala-2.12 with following steps > 1. change the pom.xml with scala-2.12 > {code} > ./dev/change-scala-version.sh 2.12 > {code} > 2.build with -Pscala-2.12 > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn > -Pparquet-provided -Dhadoop.version=2.7.3 > {code} > get the error > {code} > /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: > error: cannot find symbol > Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); > {code} > This is because sun.misc.Cleaner has been moved to new location in JDK9. > HADOOP-12760 will be the long term fix -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22659) remove sun.misc.Cleaner references
[ https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272123#comment-16272123 ] liyunzhang commented on SPARK-22659: So when you compile with -Pscala-2.12 and jdk8, there is no this issue? > remove sun.misc.Cleaner references > -- > > Key: SPARK-22659 > URL: https://issues.apache.org/jira/browse/SPARK-22659 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > build with scala-2.12 with following steps > 1. change the pom.xml with scala-2.12 > {code} > ./dev/change-scala-version.sh 2.12 > {code} > 2.build with -Pscala-2.12 > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn > -Pparquet-provided -Dhadoop.version=2.7.3 > {code} > get the error > {code} > /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: > error: cannot find symbol > Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); > {code} > This is because sun.misc.Cleaner has been moved to new location in JDK9. > HADOOP-12760 will be the long term fix -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22659) remove sun.misc.Cleaner references
[ https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272117#comment-16272117 ] Sean Owen commented on SPARK-22659: --- This isn't related to Scala 2.12 now. JDK 9 isn't supported, and this isn't the only reason. I don't think this is a valid issue therefore. > remove sun.misc.Cleaner references > -- > > Key: SPARK-22659 > URL: https://issues.apache.org/jira/browse/SPARK-22659 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > build with scala-2.12 with following steps > 1. change the pom.xml with scala-2.12 > {code} > ./dev/change-scala-version.sh 2.12 > {code} > 2.build with -Pscala-2.12 > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn > -Pparquet-provided -Dhadoop.version=2.7.3 > {code} > get the error > {code} > /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: > error: cannot find symbol > Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); > {code} > This is because sun.misc.Cleaner has been moved to new location in JDK9. > HADOOP-12760 will be the long term fix -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22659) remove sun.misc.Cleaner references
[ https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated SPARK-22659: --- Description: build with scala-2.12 with following steps 1. change the pom.xml with scala-2.12 {code} ./dev/change-scala-version.sh 2.12 {code} 2.build with -Pscala-2.12 {code} ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn -Pparquet-provided -Dhadoop.version=2.7.3 {code} get the error {code} /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: error: cannot find symbol Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); {code} This is because sun.misc.Cleaner has been moved to new location in JDK9. HADOOP-12760 will be the long term fix was: the artifactId of common/tags/pom.xml and streaming/pom.xml is spark-tags_2.11 and spark_streaming_2.11 which will causes fail when building with -Pscala-2.12. Suggest to {{scala.binary.version}} to solve > remove sun.misc.Cleaner references > -- > > Key: SPARK-22659 > URL: https://issues.apache.org/jira/browse/SPARK-22659 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > build with scala-2.12 with following steps > 1. change the pom.xml with scala-2.12 > {code} > ./dev/change-scala-version.sh 2.12 > {code} > 2.build with -Pscala-2.12 > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn > -Pparquet-provided -Dhadoop.version=2.7.3 > {code} > get the error > {code} > /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: > error: cannot find symbol > Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); > {code} > This is because sun.misc.Cleaner has been moved to new location in JDK9. > HADOOP-12760 will be the long term fix -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22659) remove sun.misc.Cleaner references
[ https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liyunzhang updated SPARK-22659: --- Summary: remove sun.misc.Cleaner references (was: Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags and streaming) > remove sun.misc.Cleaner references > -- > > Key: SPARK-22659 > URL: https://issues.apache.org/jira/browse/SPARK-22659 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > the artifactId of common/tags/pom.xml and streaming/pom.xml is > spark-tags_2.11 and spark_streaming_2.11 which will causes fail when building > with -Pscala-2.12. > Suggest to {{scala.binary.version}} to solve -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22659) Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags and streaming
[ https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272048#comment-16272048 ] liyunzhang edited comment on SPARK-22659 at 11/30/17 2:31 AM: -- I saw the script in d...@spark.apache.org and will try, thanks! was (Author: kellyzly): where is the script? > Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags > and streaming > -- > > Key: SPARK-22659 > URL: https://issues.apache.org/jira/browse/SPARK-22659 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > the artifactId of common/tags/pom.xml and streaming/pom.xml is > spark-tags_2.11 and spark_streaming_2.11 which will causes fail when building > with -Pscala-2.12. > Suggest to {{scala.binary.version}} to solve -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22659) Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags and streaming
[ https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272048#comment-16272048 ] liyunzhang commented on SPARK-22659: where is the script? > Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags > and streaming > -- > > Key: SPARK-22659 > URL: https://issues.apache.org/jira/browse/SPARK-22659 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > the artifactId of common/tags/pom.xml and streaming/pom.xml is > spark-tags_2.11 and spark_streaming_2.11 which will causes fail when building > with -Pscala-2.12. > Suggest to {{scala.binary.version}} to solve -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22659) Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags and streaming
[ https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272045#comment-16272045 ] Sean Owen commented on SPARK-22659: --- You can't put vars in the artifact names in Maven. Believe me if it were that easy we would have done it that way. The script I mentioned is the hack workaround. > Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags > and streaming > -- > > Key: SPARK-22659 > URL: https://issues.apache.org/jira/browse/SPARK-22659 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang > > the artifactId of common/tags/pom.xml and streaming/pom.xml is > spark-tags_2.11 and spark_streaming_2.11 which will causes fail when building > with -Pscala-2.12. > Suggest to {{scala.binary.version}} to solve -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22659) Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags and streaming
liyunzhang created SPARK-22659: -- Summary: Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags and streaming Key: SPARK-22659 URL: https://issues.apache.org/jira/browse/SPARK-22659 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.2.0 Reporter: liyunzhang the artifactId of common/tags/pom.xml and streaming/pom.xml is spark-tags_2.11 and spark_streaming_2.11 which will causes fail when building with -Pscala-2.12. Suggest to {{scala.binary.version}} to solve -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22630) Consolidate all configuration properties into one page
[ https://issues.apache.org/jira/browse/SPARK-22630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272033#comment-16272033 ] Hyukjin Kwon commented on SPARK-22630: -- +1 for ^. > Consolidate all configuration properties into one page > -- > > Key: SPARK-22630 > URL: https://issues.apache.org/jira/browse/SPARK-22630 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Andreas Maier > > The page https://spark.apache.org/docs/2.2.0/configuration.html gives the > impression as if all configuration properties of Spark are described on this > page. Unfortunately this is not true. The description of important properties > is spread through the documentation. The following pages list properties, > which are not described on the configuration page: > https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#performance-tuning > https://spark.apache.org/docs/2.2.0/monitoring.html#spark-configuration-options > https://spark.apache.org/docs/2.2.0/security.html#ssl-configuration > https://spark.apache.org/docs/2.2.0/sparkr.html#starting-up-from-rstudio > https://spark.apache.org/docs/2.2.0/running-on-yarn.html#spark-properties > https://spark.apache.org/docs/2.2.0/running-on-mesos.html#configuration > https://spark.apache.org/docs/2.2.0/spark-standalone.html#cluster-launch-scripts > As a reader of the documentation I would like to have single central webpage > describing all Spark configuration properties. Alternatively it would be nice > to at least add links from the configuration page to the other pages of the > documentation, where configuration properties are described. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22656) Upgrade Arrow to 0.8.0
[ https://issues.apache.org/jira/browse/SPARK-22656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272026#comment-16272026 ] Hyukjin Kwon commented on SPARK-22656: -- Hi [~zsxwing], seems a duplicate of SPARK-22324. > Upgrade Arrow to 0.8.0 > -- > > Key: SPARK-22656 > URL: https://issues.apache.org/jira/browse/SPARK-22656 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > Arrow 0.8.0 will upgrade Netty to 4.1.x and unblock SPARK-19552 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22585) Url encoding of jar path expected?
[ https://issues.apache.org/jira/browse/SPARK-22585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-22585: Assignee: Jakub Dubovsky > Url encoding of jar path expected? > -- > > Key: SPARK-22585 > URL: https://issues.apache.org/jira/browse/SPARK-22585 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Jakub Dubovsky >Assignee: Jakub Dubovsky > Fix For: 2.3.0 > > > I am calling {code}sparkContext.addJar{code} method with path to a local jar > I want to add. Example: > {code}/home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar{code}. > As a result I get an exception saying > {code} > Failed to add > /home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar to Spark > environment. Stacktrace: > java.io.FileNotFoundException: Jar > /home/me/.coursier/cache/v1/https/artifactory.com:443/path/to.jar not found > {code} > Important part to notice here is that colon character is url encoded in path > I want to use but exception is complaining about path in decoded form. This > is caused by this line of code from implementation ([see > here|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/SparkContext.scala#L1833]): > {code} > case null | "file" => addJarFile(new File(uri.getPath)) > {code} > It uses > [getPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getPath()] > method of > [java.net.URI|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html] > which url decodes the path. I believe method > [getRawPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getRawPath()] > should be used here which keeps path string in original form. > I tend to see this as a bug since I want to use my dependencies resolved from > artifactory with port directly. Is there some specific reason for this or can > we fix this? > Thanks -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22585) Url encoding of jar path expected?
[ https://issues.apache.org/jira/browse/SPARK-22585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-22585. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19834 [https://github.com/apache/spark/pull/19834] > Url encoding of jar path expected? > -- > > Key: SPARK-22585 > URL: https://issues.apache.org/jira/browse/SPARK-22585 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Jakub Dubovsky > Fix For: 2.3.0 > > > I am calling {code}sparkContext.addJar{code} method with path to a local jar > I want to add. Example: > {code}/home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar{code}. > As a result I get an exception saying > {code} > Failed to add > /home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar to Spark > environment. Stacktrace: > java.io.FileNotFoundException: Jar > /home/me/.coursier/cache/v1/https/artifactory.com:443/path/to.jar not found > {code} > Important part to notice here is that colon character is url encoded in path > I want to use but exception is complaining about path in decoded form. This > is caused by this line of code from implementation ([see > here|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/SparkContext.scala#L1833]): > {code} > case null | "file" => addJarFile(new File(uri.getPath)) > {code} > It uses > [getPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getPath()] > method of > [java.net.URI|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html] > which url decodes the path. I believe method > [getRawPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getRawPath()] > should be used here which keeps path string in original form. > I tend to see this as a bug since I want to use my dependencies resolved from > artifactory with port directly. Is there some specific reason for this or can > we fix this? > Thanks -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22373) Intermittent NullPointerException in org.codehaus.janino.IClass.isAssignableFrom
[ https://issues.apache.org/jira/browse/SPARK-22373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271975#comment-16271975 ] Leigh Klotz commented on SPARK-22373: -- [~mshen] Thank you. I've hand-upgraded janino and commons-compiler to 3.0.7, and did no other dependencies. The NPE has not occurred, and I'm running further tests to make sure there are no other ill effects. > Intermittent NullPointerException in > org.codehaus.janino.IClass.isAssignableFrom > > > Key: SPARK-22373 > URL: https://issues.apache.org/jira/browse/SPARK-22373 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.1 > Environment: Hortonworks distribution: HDP 2.6.2.0-205 , > /usr/hdp/current/spark2-client/jars/spark-core_2.11-2.1.1.2.6.2.0-205.jar >Reporter: Dan Meany >Priority: Minor > Attachments: CodeGeneratorTester.scala, generated.java > > > Very occasional and retry works. > Full stack: > 17/10/27 21:06:15 ERROR Executor: Exception in task 29.0 in stage 12.0 (TID > 758) > java.lang.NullPointerException > at org.codehaus.janino.IClass.isAssignableFrom(IClass.java:569) > at > org.codehaus.janino.UnitCompiler.isWideningReferenceConvertible(UnitCompiler.java:10347) > at > org.codehaus.janino.UnitCompiler.isMethodInvocationConvertible(UnitCompiler.java:8636) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:8427) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:8285) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8169) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8071) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4421) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:550) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at >
[jira] [Commented] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
[ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271944#comment-16271944 ] Sean Owen commented on SPARK-22657: --- I wouldn't have expected that to work. The user app classloader wouldn't be usable to Spark code. Are you saying there's an easy workaround though? sure, if so, but I suspect there are other reasons this wouldn't work. > Hadoop fs implementation classes are not loaded if they are part of the app > jar or other jar when --packages flag is used > -- > > Key: SPARK-22657 > URL: https://issues.apache.org/jira/browse/SPARK-22657 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Stavros Kontopoulos > > To reproduce this issue run: > ./bin/spark-submit --master mesos://leader.mesos:5050 \ > --packages com.github.scopt:scopt_2.11:3.5.0 \ > --conf spark.cores.max=8 \ > --conf > spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 > \ > --conf spark.mesos.executor.docker.forcePullImage=true \ > --class S3Job > http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar > \ > --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl > s3n://arand-sandbox-mesosphere/linecount.out > within a container created with > mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image > You will get: "Exception in thread "main" java.io.IOException: No FileSystem > for scheme: s3n" > This can be run reproduced with local[*] as well, no need to use mesos, this > is not mesos bug. > The specific spark job used above can be found here: > https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala > > Can be built with sbt assembly in that dir. > Using this code : > https://gist.github.com/skonto/4f5ff1e5ede864f90b323cc20bf1e1cbat the > beginning of the main method... > you get the following output : > https://gist.github.com/skonto/d22b8431586b6663ddd720e179030da4 > (Use > http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar > to to get the modified job) > The job works fine if --packages is not used. > The commit that introduced this issue is (before that things work as > expected): > 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add > glob support for resources adding to Spark [32m(5 months ago) > [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 > The exception comes from here: > https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311 > https://github.com/apache/spark/pull/18235/files, check line 950, this is > where a filesystem is first created. > The Filesystem class is initialized there, before the main of the spark job > is launched... the reason is --packages logic uses hadoop libraries to > download files > Maven resolution happens before the app jar and the resolved jars are added > to the classpath. So at that moment there is no s3n to add to the static map > when the Filesystem static members are first initialized and also filled due > to the first FileSystem instance created (SERVICE_FILE_SYSTEMS). > Later in the spark job main where we try to access the s3n filesystem (create > a second filesystem) we get the exception (at this point the app jar has the > s3n implementation in it and its on the class path but that scheme is not > loaded in the static map of the Filesystem class)... > hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the > problem is with the static map which is filled once and only once. > That's why we see two prints of the map contents in the output(gist) above > when --packages is used. The first print is before creating the s3n > filesystem. We use reflection there to get the static map's entries. When > --packages is not used that map is empty before creating the s3n filesystem > since up to that point the Filesystem class is not yet loaded by the > classloader. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22608) Avoid code duplication regarding CodeGeneration.splitExpressions()
[ https://issues.apache.org/jira/browse/SPARK-22608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22608. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19821 [https://github.com/apache/spark/pull/19821] > Avoid code duplication regarding CodeGeneration.splitExpressions() > -- > > Key: SPARK-22608 > URL: https://issues.apache.org/jira/browse/SPARK-22608 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.3.0 > > > Since several {{CodeGenenerator.splitExpression}} are used with > {{ctx.INPUT_ROW}}, it would be good to prepare APIs for this to avoid code > duplication. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22608) Avoid code duplication regarding CodeGeneration.splitExpressions()
[ https://issues.apache.org/jira/browse/SPARK-22608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22608: --- Assignee: Kazuaki Ishizaki > Avoid code duplication regarding CodeGeneration.splitExpressions() > -- > > Key: SPARK-22608 > URL: https://issues.apache.org/jira/browse/SPARK-22608 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.3.0 > > > Since several {{CodeGenenerator.splitExpression}} are used with > {{ctx.INPUT_ROW}}, it would be good to prepare APIs for this to avoid code > duplication. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271930#comment-16271930 ] Sean Owen commented on SPARK-22658: --- I don't see a strong reason this needs to be part of Spark. It shifts maintenance to the core project for not much of any gain. It also tends to bless a single deep-learning-on-Spark project among several. I would say 'no' to this, but instead focus on whatever changes in the core help support libraries like this (like the image representation SPIP recently) > SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark > > > Key: SPARK-22658 > URL: https://issues.apache.org/jira/browse/SPARK-22658 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Andy Feng > Attachments: SPIP_ TensorFlowOnSpark.pdf > > Original Estimate: 336h > Remaining Estimate: 336h > > TensorFlowOnSpark (TFoS) was released at github for distributed TensorFlow > training and inference on Apache Spark clusters. TFoS is designed to: > * Easily migrate all existing TensorFlow programs with minimum code change; > * Support all TensorFlow functionalities: synchronous/asynchronous training, > model/data parallelism, inference and TensorBoard; > * Easily integrate with your existing data processing pipelines (ex. Spark > SQL) and machine learning algorithms (ex. MLlib); > * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and > Infiniband. > We propose to merge TFoS into Apache Spark as a scalable deep learning > library to: > * Make deep learning easy for Apache Spark community: Familiar pipeline API > for training and inference; Enable TensorFlow training/inference on existing > Spark clusters. > * Further simplify data scientist experience: Ensure compatibility b/w Apache > Spark and TFoS; Reduce steps for installation. > * Help Apache Spark evolutions on deep learning: Establish a design pattern > for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL > training/inference. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Feng updated SPARK-22658: -- Attachment: SPIP_ TensorFlowOnSpark.pdf > SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark > > > Key: SPARK-22658 > URL: https://issues.apache.org/jira/browse/SPARK-22658 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Andy Feng > Attachments: SPIP_ TensorFlowOnSpark.pdf > > Original Estimate: 336h > Remaining Estimate: 336h > > TensorFlowOnSpark (TFoS) was released at github for distributed TensorFlow > training and inference on Apache Spark clusters. TFoS is designed to: > * Easily migrate all existing TensorFlow programs with minimum code change; > * Support all TensorFlow functionalities: synchronous/asynchronous training, > model/data parallelism, inference and TensorBoard; > * Easily integrate with your existing data processing pipelines (ex. Spark > SQL) and machine learning algorithms (ex. MLlib); > * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and > Infiniband. > We propose to merge TFoS into Apache Spark as a scalable deep learning > library to: > * Make deep learning easy for Apache Spark community: Familiar pipeline API > for training and inference; Enable TensorFlow training/inference on existing > Spark clusters. > * Further simplify data scientist experience: Ensure compatibility b/w Apache > Spark and TFoS; Reduce steps for installation. > * Help Apache Spark evolutions on deep learning: Establish a design pattern > for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL > training/inference. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Feng updated SPARK-22658: -- Description: TensorFlowOnSpark (TFoS) was released at github for distributed TensorFlow training and inference on Apache Spark clusters. TFoS is designed to: * Easily migrate all existing TensorFlow programs with minimum code change; * Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inference and TensorBoard; * Easily integrate with your existing data processing pipelines (ex. Spark SQL) and machine learning algorithms (ex. MLlib); * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband. We propose to merge TFoS into Apache Spark as a scalable deep learning library to: * Make deep learning easy for Apache Spark community: Familiar pipeline API for training and inference; Enable TensorFlow training/inference on existing Spark clusters. * Further simplify data scientist experience: Ensure compatibility b/w Apache Spark and TFoS; Reduce steps for installation. * Help Apache Spark evolutions on deep learning: Establish a design pattern for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL training/inference. was: SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark Authors: Lee Yang (Yahoo/Oath), Andrew Feng (Yahoo/Oath) Background and Motivation Deep learning has evolved significantly in recent years, and is often considered a desired mechanism to gain insight from massive amounts of data. TensorFlow is currently the most popular deep learning library, and has been adopted by many organizations to solve a variety of use cases. After TensorFlow’s initial publication, Google released an enhanced TensorFlow with distributed deep learning capabilities in April 2016. In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed TensorFlow training and inference on Apache Spark clusters. TFoS is designed to: Easily migrate all existing TensorFlow programs with minimum code change; Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inference and TensorBoard; Easily integrate with your existing data processing pipelines (ex. Spark SQL) and machine learning algorithms (ex. MLlib); Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband. At Yahoo/Oath, TFoS has become the most popular deep learning framework for many types of mission critical use cases, many which use 10’s servers of CPU or GPU. Outside Yahoo, TFoS has generated interest from LinkedIn, Paytm Labs, Hops Hadoop, Cloudera, MapR and Google. TFoS has become a popular choice for distributed TensorFlow applications on Spark clusters. We propose to merge TFoS into Apache Spark as a scalable deep learning library to: Make deep learning easy for Apache Spark community Familiar pipeline API for training and inference Enable TensorFlow training/inference on existing Spark clusters Further simplify data scientist experience Ensure compatibility b/w Apache Spark and TFoS Reduce steps for installation Help Apache Spark evolutions on deep learning Establish a design pattern for additional frameworks (ex. Caffe, BigDL, CNTK) Structured streaming for DL training/inference Target Personas Data scientists Data engineers Library developers Goals Spark ML style API for distributed TensorFlow training and inference Support all types of TensorFlow applications (ex. asynchronous learning, model parallelism) and functionalities (ex. TensorBoard) Support all TensorFlow trained models to be used for scalable inference and transfer learning with ZERO custom code Support all Spark schedulers, including standalone, YARN, and Mesos Support TensorFlow 1.0 and later Initially Python API only Scala and Java API could be added for inference later Non-Goals Deep learning frameworks beyond TensorFlow Non-distributed TensorFlow applications on Apache Spark (ex. single node, or parallel execution for hyper-parameter search) Proposed API Changes Pipeline API: TFEstimator model = TFEstimator(train_fn, tf_args) .setInputMapping({“image”: “placeholder_X”, “label”: “placeholder_Y”}) .setModelDir(“my_model_checkpoints”) .setSteps(1) .setEpochs(10) .fit(training_data_frame) TFEstimator is a Spark ML estimator which launches a TensorFlowOnSpark cluster for distributed training. Its constructor TFEstimator(train_fn, tf_args, export_fn) accepts the following arguments: train_fn ... TensorFlow "main" function for training. tf_args ... Dictionary of arguments specific to TensorFlow "main" function. export_fn ... TensorFlow function for exporting a saved_model. TFEstimator has a collection of parameters including InputMapping … Mapping of input DataFrame column to input tensor ModelDir … Path to save/load model checkpoints ExportDir … Directory
[jira] [Updated] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Feng updated SPARK-22658: -- Description: SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark Authors: Lee Yang (Yahoo/Oath), Andrew Feng (Yahoo/Oath) Background and Motivation Deep learning has evolved significantly in recent years, and is often considered a desired mechanism to gain insight from massive amounts of data. TensorFlow is currently the most popular deep learning library, and has been adopted by many organizations to solve a variety of use cases. After TensorFlow’s initial publication, Google released an enhanced TensorFlow with distributed deep learning capabilities in April 2016. In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed TensorFlow training and inference on Apache Spark clusters. TFoS is designed to: Easily migrate all existing TensorFlow programs with minimum code change; Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inference and TensorBoard; Easily integrate with your existing data processing pipelines (ex. Spark SQL) and machine learning algorithms (ex. MLlib); Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband. At Yahoo/Oath, TFoS has become the most popular deep learning framework for many types of mission critical use cases, many which use 10’s servers of CPU or GPU. Outside Yahoo, TFoS has generated interest from LinkedIn, Paytm Labs, Hops Hadoop, Cloudera, MapR and Google. TFoS has become a popular choice for distributed TensorFlow applications on Spark clusters. We propose to merge TFoS into Apache Spark as a scalable deep learning library to: Make deep learning easy for Apache Spark community Familiar pipeline API for training and inference Enable TensorFlow training/inference on existing Spark clusters Further simplify data scientist experience Ensure compatibility b/w Apache Spark and TFoS Reduce steps for installation Help Apache Spark evolutions on deep learning Establish a design pattern for additional frameworks (ex. Caffe, BigDL, CNTK) Structured streaming for DL training/inference Target Personas Data scientists Data engineers Library developers Goals Spark ML style API for distributed TensorFlow training and inference Support all types of TensorFlow applications (ex. asynchronous learning, model parallelism) and functionalities (ex. TensorBoard) Support all TensorFlow trained models to be used for scalable inference and transfer learning with ZERO custom code Support all Spark schedulers, including standalone, YARN, and Mesos Support TensorFlow 1.0 and later Initially Python API only Scala and Java API could be added for inference later Non-Goals Deep learning frameworks beyond TensorFlow Non-distributed TensorFlow applications on Apache Spark (ex. single node, or parallel execution for hyper-parameter search) Proposed API Changes Pipeline API: TFEstimator model = TFEstimator(train_fn, tf_args) .setInputMapping({“image”: “placeholder_X”, “label”: “placeholder_Y”}) .setModelDir(“my_model_checkpoints”) .setSteps(1) .setEpochs(10) .fit(training_data_frame) TFEstimator is a Spark ML estimator which launches a TensorFlowOnSpark cluster for distributed training. Its constructor TFEstimator(train_fn, tf_args, export_fn) accepts the following arguments: train_fn ... TensorFlow "main" function for training. tf_args ... Dictionary of arguments specific to TensorFlow "main" function. export_fn ... TensorFlow function for exporting a saved_model. TFEstimator has a collection of parameters including InputMapping … Mapping of input DataFrame column to input tensor ModelDir … Path to save/load model checkpoints ExportDir … Directory to export saved_model BatchSize … Number of records per batch (default: 100) ClusterSize … Number of nodes in the cluster (default: 1) NumPS … Number of PS nodes in cluster (default: 0) Readers … Number of reader/enqueue threads (default: 1) Tensorboard … Boolean flag indicating tensorboard launch or not (default: false) Steps … Maximum number of steps to train (default: 1000) Epochs … Number of epochs to train (default: 1) Protocol … Network protocol for Tensorflow (grpc|rdma) (default: grpc) InputMode … Input data feeding mode (TENSORFLOW, SPARK) (default: SPARK) TFEstimator.fit(dataset) trains a TensorFlow model based on the given training dataset. The training dataset is a Spark DataFrame with columns that will be mapped to TensorFlow tensors as specified by InputMapping parameter. TFEstimator.fit() returns a TFModel instance representing the trained model, backed on disk by a TensorFlow checkpoint or saved_model. TensorFlow Training Application: train_fun(tf_args, TFContext) The 1st argument for TFEstimator, train_fun, allows custom TensorFlow applications to be easily plugged into the Spark
[jira] [Updated] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andy Feng updated SPARK-22658: -- Description: In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed TensorFlow training and inference on Apache Spark clusters. TFoS is designed to: * Easily migrate all existing TensorFlow programs with minimum code change; * Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inference and TensorBoard; * Easily integrate with your existing data processing pipelines (ex. Spark SQL) and machine learning algorithms (ex. MLlib); * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband. We propose to merge TFoS into Apache Spark as a scalable deep learning library to: * Make deep learning easy for Apache Spark community: Familiar pipeline API for training and inference; Enable TensorFlow training/inference on existing Spark clusters. * Further simplify data scientist experience: Ensure compatibility b/w Apache Spark and TFoS; Reduce steps for installation. * Help Apache Spark evolution on deep learning: Establish a design pattern for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL training/inference. was: In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed TensorFlow training and inference on Apache Spark clusters. TFoS is designed to: * Easily migrate all existing TensorFlow programs with minimum code change; * Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inference and TensorBoard; * Easily integrate with your existing data processing pipelines (ex. Spark SQL) and machine learning algorithms (ex. MLlib); * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband. We propose to merge TFoS into Apache Spark as a scalable deep learning library to: * Make deep learning easy for Apache Spark community: Familiar pipeline API for training and inference; Enable TensorFlow training/inference on existing Spark clusters. * Further simplify data scientist experience: Ensure compatibility b/w Apache Spark and TFoS; Reduce steps for installation. * Help Apache Spark evolution on deep learning: Establish a design pattern for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL training/inference. > SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark > > > Key: SPARK-22658 > URL: https://issues.apache.org/jira/browse/SPARK-22658 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.2.0 >Reporter: Andy Feng > Original Estimate: 336h > Remaining Estimate: 336h > > In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed > TensorFlow training and inference on Apache Spark clusters. TFoS is designed > to: >* Easily migrate all existing TensorFlow programs with minimum code change; >* Support all TensorFlow functionalities: synchronous/asynchronous > training, model/data parallelism, inference and TensorBoard; >* Easily integrate with your existing data processing pipelines (ex. Spark > SQL) and machine learning algorithms (ex. MLlib); >* Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and > Infiniband. > We propose to merge TFoS into Apache Spark as a scalable deep learning > library to: > * Make deep learning easy for Apache Spark community: Familiar pipeline API > for training and inference; Enable TensorFlow training/inference on existing > Spark clusters. > * Further simplify data scientist experience: Ensure compatibility b/w Apache > Spark and TFoS; Reduce steps for installation. > * Help Apache Spark evolution on deep learning: Establish a design pattern > for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL > training/inference. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark
Andy Feng created SPARK-22658: - Summary: SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark Key: SPARK-22658 URL: https://issues.apache.org/jira/browse/SPARK-22658 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0 Reporter: Andy Feng In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed TensorFlow training and inference on Apache Spark clusters. TFoS is designed to: * Easily migrate all existing TensorFlow programs with minimum code change; * Support all TensorFlow functionalities: synchronous/asynchronous training, model/data parallelism, inference and TensorBoard; * Easily integrate with your existing data processing pipelines (ex. Spark SQL) and machine learning algorithms (ex. MLlib); * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband. We propose to merge TFoS into Apache Spark as a scalable deep learning library to: * Make deep learning easy for Apache Spark community: Familiar pipeline API for training and inference; Enable TensorFlow training/inference on existing Spark clusters. * Further simplify data scientist experience: Ensure compatibility b/w Apache Spark and TFoS; Reduce steps for installation. * Help Apache Spark evolution on deep learning: Establish a design pattern for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL training/inference. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
[ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22657: Description: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos bug. The specific spark job used above can be found here: https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/skonto/4f5ff1e5ede864f90b323cc20bf1e1cbat the beginning of the main method... you get the following output : https://gist.github.com/skonto/d22b8431586b6663ddd720e179030da4 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is (before that things work as expected): 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 The exception comes from here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311 https://github.com/apache/spark/pull/18235/files, check line 950, this is where a filesystem is first created. The Filesystem class is initialized there, before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled due to the first FileSystem instance created (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem (create a second filesystem) we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints of the map contents in the output(gist) above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries. When --packages is not used that map is empty before creating the s3n filesystem since up to that point the Filesystem class is not yet loaded by the classloader. was: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos bug. The specific spark job used above can be found here: https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced
[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
[ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22657: Description: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos bug. The specific spark job used above can be found here: https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is (before that things work as expected): 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 The exception comes from here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311 https://github.com/apache/spark/pull/18235/files, check line 950, this is where a filesystem is first created. The Filesystem class is initialized there, before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled due to the first FileSystem instance created (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem (create a second filesystem) we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints of the map contents in the output(gist) above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries. When --packages is not used that map is empty before creating the s3n filesystem since up to that point the Filesystem class is not yet loaded by the classloader. was: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos bug. The specific spark job used above can be found here: https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced
[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
[ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22657: Description: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos bug. The specific spark job used above can be found here: https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is (before that things work as expected): 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 The exception comes from here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311 https://github.com/apache/spark/pull/18235/files, check line 950, this is where a filesystem is first created. The Filesystem class is initialized there, before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled due to the first FileSystem instance created (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem (create a second filesystem) we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints of the map contents in the output(gist) above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader. was: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos bug. The specific spark job used above can be found here: https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is (before that things work as
[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
[ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22657: Description: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos bug. The specific spark job used above can be found [here] https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is (before that things work as expected): 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 The exception comes from here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311 https://github.com/apache/spark/pull/18235/files, check line 950, this is where a filesystem is first created. The Filesystem class is initialized there, before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled due to the first FileSystem instance created (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem (create a second filesystem) we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints of the map contents in the output(gist) above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader. was: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos bug. The specific spark job used above is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is (before that things work as
[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
[ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22657: Description: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos bug. The specific spark job used above can be found here: https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is (before that things work as expected): 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 The exception comes from here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311 https://github.com/apache/spark/pull/18235/files, check line 950, this is where a filesystem is first created. The Filesystem class is initialized there, before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled due to the first FileSystem instance created (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem (create a second filesystem) we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints of the map contents in the output(gist) above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader. was: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos bug. The specific spark job used above can be found [here] https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is (before that things work as
[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
[ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22657: Description: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well, no need to use mesos, this is not mesos bug. The specific spark job used above is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is (before that things work as expected): 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 The exception comes from here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311 https://github.com/apache/spark/pull/18235/files, check line 950, this is where a filesystem is first created. The Filesystem class is initialized there, before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled due to the first FileSystem instance created (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem (create a second filesystem) we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints of the map contents in the output(gist) above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader. was: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well. The specific spark job used above is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is (before that things work as expected): 5800144a54f5c0180ccf67392f32c3e8a51119b1[m
[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
[ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22657: Description: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well. The specific spark job used above is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is (before that things work as expected): 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 The exception comes from here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311 https://github.com/apache/spark/pull/18235/files, check line 950, this is where a filesystem is first created. The Filesystem class is initialized there, before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled due to the first FileSystem instance created (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem (create a second filesystem) we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints of the map contents in the output(gist) above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader. was: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well. The specific spark job used above is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is: 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5
[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
[ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22657: Description: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well. The specific spark job used above is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 (Use http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar to to get the modified job) The job works fine if --packages is not used. The commit that introduced this issue is: 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 The exception comes from here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311 https://github.com/apache/spark/pull/18235/files, check line 950, this is where a filesystem is first created. The Filesystem class is initialized there, before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled due to the first FileSystem instance created (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem (create a second filesystem) we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints of the map contents in the output(gist) above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader. was: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well. The specific spark job used above is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 The job works fine if --packages is not used. The commit that introduced this issue is: 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 The exception comes from here:
[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
[ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22657: Description: To reproduce this issue run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image You will get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well. The specific spark job used above is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Can be built with sbt assembly in that dir. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the beginning of the main method... you get the following output : https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 The job works fine if --packages is not used. The commit that introduced this issue is: 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 The exception comes from here: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311 https://github.com/apache/spark/pull/18235/files, check line 950, this is where a filesystem is first created. The Filesystem class is initialized there, before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled due to the first FileSystem instance created (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem (create a second filesystem) we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints of the map contents in the output(gist) above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader. was: To reproduce this issue run: ``` ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out ``` within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 You get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well. The specific spark job used is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 The commit that introduced this is: 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 https://github.com/apache/spark/pull/18235/files check line 950 The Filesystem class is initialized already before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are
[jira] [Commented] (SPARK-22646) Spark on Kubernetes - basic submission client
[ https://issues.apache.org/jira/browse/SPARK-22646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271818#comment-16271818 ] Yinan Li commented on SPARK-22646: -- The "Component/s" field should be updated. > Spark on Kubernetes - basic submission client > - > > Key: SPARK-22646 > URL: https://issues.apache.org/jira/browse/SPARK-22646 > Project: Spark > Issue Type: Sub-task > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan > > The submission client is responsible for creating the Kubernetes pod that > runs the Spark driver. It is a set of client-side changes to enable the > scheduler backend. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
[ https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stavros Kontopoulos updated SPARK-22657: Description: To reproduce this issue run: ``` ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out ``` within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 You get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well. The specific spark job used is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 The commit that introduced this is: 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 https://github.com/apache/spark/pull/18235/files check line 950 The Filesystem class is initialized already before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints of the map contents in the output above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries btw. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader. was: Reproduce run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 You get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well. The specific spark job used is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 The commit that introduced this is: 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 https://github.com/apache/spark/pull/18235/files check line 950 The Filesystem class is initialized already before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints
[jira] [Created] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used
Stavros Kontopoulos created SPARK-22657: --- Summary: Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used Key: SPARK-22657 URL: https://issues.apache.org/jira/browse/SPARK-22657 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Stavros Kontopoulos Reproduce run: ./bin/spark-submit --master mesos://leader.mesos:5050 \ --packages com.github.scopt:scopt_2.11:3.5.0 \ --conf spark.cores.max=8 \ --conf spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 \ --conf spark.mesos.executor.docker.forcePullImage=true \ --class S3Job http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar \ --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl s3n://arand-sandbox-mesosphere/linecount.out within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 You get: "Exception in thread "main" java.io.IOException: No FileSystem for scheme: s3n" This can be run reproduced with local[*] as well. The specific spark job used is[[ https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala | here ]]. Using this code : https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5 The commit that introduced this is: 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add glob support for resources adding to Spark [32m(5 months ago) [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800 https://github.com/apache/spark/pull/18235/files check line 950 The Filesystem class is initialized already before the main of the spark job is launched... the reason is --packages logic uses hadoop libraries to download files Maven resolution happens before the app jar and the resolved jars are added to the classpath. So at that moment there is no s3n to add to the static map when the Filesystem static members are first initialized and also filled (SERVICE_FILE_SYSTEMS). Later in the spark job main where we try to access the s3n filesystem we get the exception (at this point the app jar has the s3n implementation in it and its on the class path but that scheme is not loaded in the static map of the Filesystem class)... hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the problem is with the static map which is filled once and only once. That's why we see two prints of the map contents in the output above when --packages is used. The first print is before creating the s3n filesystem. We use reflection there to get the static map's entries btw. When --packages is not used that map is empty since the Filesystem class is not yet loaded by the classloader. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22647) Docker files for image creation
[ https://issues.apache.org/jira/browse/SPARK-22647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271813#comment-16271813 ] Yinan Li commented on SPARK-22647: -- Note: Some reference Dockerfiles are included in https://github.com/apache/spark/pull/19717. > Docker files for image creation > --- > > Key: SPARK-22647 > URL: https://issues.apache.org/jira/browse/SPARK-22647 > Project: Spark > Issue Type: Sub-task > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan > > This covers the dockerfiles that need to be shipped to enable the Kubernetes > backend for Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22656) Upgrade Arrow to 0.8.0
Shixiong Zhu created SPARK-22656: Summary: Upgrade Arrow to 0.8.0 Key: SPARK-22656 URL: https://issues.apache.org/jira/browse/SPARK-22656 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 2.2.0 Reporter: Shixiong Zhu Arrow 0.8.0 will upgrade Netty to 4.1.x and unblock SPARK-19552 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20650) Remove JobProgressListener (and other unneeded classes)
[ https://issues.apache.org/jira/browse/SPARK-20650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-20650. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19750 [https://github.com/apache/spark/pull/19750] > Remove JobProgressListener (and other unneeded classes) > --- > > Key: SPARK-20650 > URL: https://issues.apache.org/jira/browse/SPARK-20650 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin > Fix For: 2.3.0 > > > See spec in parent issue (SPARK-18085) for more details. > This task tracks removing JobProgressListener and other classes that will be > made obsolete by the other changes in this project, and making adjustments to > parts of the code that still rely on them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark
[ https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-18935. Resolution: Fixed Assignee: Stavros Kontopoulos Fix Version/s: 2.3.0 > Use Mesos "Dynamic Reservation" resource for Spark > -- > > Key: SPARK-18935 > URL: https://issues.apache.org/jira/browse/SPARK-18935 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.0, 2.0.1, 2.0.2 >Reporter: jackyoh >Assignee: Stavros Kontopoulos > Fix For: 2.3.0 > > > I'm running spark on Apache Mesos > Please follow these steps to reproduce the issue: > 1. First, run Mesos resource reserve: > curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d > resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]' > -X POST http://192.168.1.118:5050/master/reserve > 2. Then run spark-submit command: > ./spark-submit --class org.apache.spark.examples.SparkPi --master > mesos://192.168.1.118:5050 --conf spark.mesos.role=spark > ../examples/jars/spark-examples_2.11-2.0.2.jar 1 > And the console will keep loging same warning message as shown below: > 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any > resources; check your cluster UI to ensure that workers are registered and > have sufficient resources -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22655) Fail task instead of complete task silently in PythonRunner during shutdown
[ https://issues.apache.org/jira/browse/SPARK-22655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271626#comment-16271626 ] Apache Spark commented on SPARK-22655: -- User 'icexelloss' has created a pull request for this issue: https://github.com/apache/spark/pull/19852 > Fail task instead of complete task silently in PythonRunner during shutdown > --- > > Key: SPARK-22655 > URL: https://issues.apache.org/jira/browse/SPARK-22655 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Li Jin > > We have observed in our production environment that during Spark shutdown, if > there are some active tasks, sometimes they will complete with incorrect > results. We've tracked down the issue to a PythonRunner where it is returning > partial result instead of throwing exception during Spark shutdown. > I think the better way to handle this is to have these tasks fail instead of > complete with partial results (complete with partial is always bad IMHO) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22655) Fail task instead of complete task silently in PythonRunner during shutdown
[ https://issues.apache.org/jira/browse/SPARK-22655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22655: Assignee: Apache Spark > Fail task instead of complete task silently in PythonRunner during shutdown > --- > > Key: SPARK-22655 > URL: https://issues.apache.org/jira/browse/SPARK-22655 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Li Jin >Assignee: Apache Spark > > We have observed in our production environment that during Spark shutdown, if > there are some active tasks, sometimes they will complete with incorrect > results. We've tracked down the issue to a PythonRunner where it is returning > partial result instead of throwing exception during Spark shutdown. > I think the better way to handle this is to have these tasks fail instead of > complete with partial results (complete with partial is always bad IMHO) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22655) Fail task instead of complete task silently in PythonRunner during shutdown
[ https://issues.apache.org/jira/browse/SPARK-22655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22655: Assignee: (was: Apache Spark) > Fail task instead of complete task silently in PythonRunner during shutdown > --- > > Key: SPARK-22655 > URL: https://issues.apache.org/jira/browse/SPARK-22655 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Li Jin > > We have observed in our production environment that during Spark shutdown, if > there are some active tasks, sometimes they will complete with incorrect > results. We've tracked down the issue to a PythonRunner where it is returning > partial result instead of throwing exception during Spark shutdown. > I think the better way to handle this is to have these tasks fail instead of > complete with partial results (complete with partial is always bad IMHO) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22655) Fail task instead of complete task silently in PythonRunner during shutdown
[ https://issues.apache.org/jira/browse/SPARK-22655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271620#comment-16271620 ] Li Jin commented on SPARK-22655: PR: https://github.com/apache/spark/pull/19852 > Fail task instead of complete task silently in PythonRunner during shutdown > --- > > Key: SPARK-22655 > URL: https://issues.apache.org/jira/browse/SPARK-22655 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Li Jin > > We have observed in our production environment that during Spark shutdown, if > there are some active tasks, sometimes they will complete with incorrect > results. We've tracked down the issue to a PythonRunner where it is returning > partial result instead of throwing exception during Spark shutdown. > I think the better way to handle this is to have these tasks fail instead of > complete with partial results (complete with partial is always bad IMHO) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22655) Fail task instead of complete task silently in PythonRunner during shutdown
Li Jin created SPARK-22655: -- Summary: Fail task instead of complete task silently in PythonRunner during shutdown Key: SPARK-22655 URL: https://issues.apache.org/jira/browse/SPARK-22655 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0, 2.1.0, 2.0.2 Reporter: Li Jin We have observed in our production environment that during Spark shutdown, if there are some active tasks, sometimes they will complete with incorrect results. We've tracked down the issue to a PythonRunner where it is returning partial result instead of throwing exception during Spark shutdown. I think the better way to handle this is to have these tasks fail instead of complete with partial results (complete with partial is always bad IMHO) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22654) Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-22654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271365#comment-16271365 ] Apache Spark commented on SPARK-22654: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/19851 > Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite > --- > > Key: SPARK-22654 > URL: https://issues.apache.org/jira/browse/SPARK-22654 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > HiveExternalCatalogVersionsSuite has failed a few times apparently after > failing to download Spark tarballs from a particular mirror. This could be > mitigated with some retry logic, at least. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22654) Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-22654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22654: Assignee: Apache Spark (was: Sean Owen) > Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite > --- > > Key: SPARK-22654 > URL: https://issues.apache.org/jira/browse/SPARK-22654 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Sean Owen >Assignee: Apache Spark >Priority: Minor > > HiveExternalCatalogVersionsSuite has failed a few times apparently after > failing to download Spark tarballs from a particular mirror. This could be > mitigated with some retry logic, at least. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22654) Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite
[ https://issues.apache.org/jira/browse/SPARK-22654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22654: Assignee: Sean Owen (was: Apache Spark) > Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite > --- > > Key: SPARK-22654 > URL: https://issues.apache.org/jira/browse/SPARK-22654 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Minor > > HiveExternalCatalogVersionsSuite has failed a few times apparently after > failing to download Spark tarballs from a particular mirror. This could be > mitigated with some retry logic, at least. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22654) Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite
Sean Owen created SPARK-22654: - Summary: Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite Key: SPARK-22654 URL: https://issues.apache.org/jira/browse/SPARK-22654 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.3.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor HiveExternalCatalogVersionsSuite has failed a few times apparently after failing to download Spark tarballs from a particular mirror. This could be mitigated with some retry logic, at least. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null
[ https://issues.apache.org/jira/browse/SPARK-22653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271254#comment-16271254 ] Apache Spark commented on SPARK-22653: -- User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/19850 > executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap > is null > --- > > Key: SPARK-22653 > URL: https://issues.apache.org/jira/browse/SPARK-22653 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Thomas Graves > > In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address > (executorRef.address) can be null. > val data = new ExecutorData(executorRef, executorRef.address, hostname, > cores, cores, logUrls) > At this point the executorRef.address can be null, there is actually code > above it that handles this case: > // If the executor's rpc env is not listening for incoming connections, > `hostPort` > // will be null, and the client connection should be used to > contact the executor. > val executorAddress = if (executorRef.address != null) { > executorRef.address > } else { > context.senderAddress > } > But it doesn't use executorAddress when it creates the ExecutorData. > This causes removeExecutor to never remove it properly from > addressToExecutorId. > addressToExecutorId -= executorInfo.executorAddress > This is also a memory leak and can also call onDisconnected to call > disableExecutor when it shouldn't. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null
[ https://issues.apache.org/jira/browse/SPARK-22653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22653: Assignee: Apache Spark > executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap > is null > --- > > Key: SPARK-22653 > URL: https://issues.apache.org/jira/browse/SPARK-22653 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Thomas Graves >Assignee: Apache Spark > > In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address > (executorRef.address) can be null. > val data = new ExecutorData(executorRef, executorRef.address, hostname, > cores, cores, logUrls) > At this point the executorRef.address can be null, there is actually code > above it that handles this case: > // If the executor's rpc env is not listening for incoming connections, > `hostPort` > // will be null, and the client connection should be used to > contact the executor. > val executorAddress = if (executorRef.address != null) { > executorRef.address > } else { > context.senderAddress > } > But it doesn't use executorAddress when it creates the ExecutorData. > This causes removeExecutor to never remove it properly from > addressToExecutorId. > addressToExecutorId -= executorInfo.executorAddress > This is also a memory leak and can also call onDisconnected to call > disableExecutor when it shouldn't. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null
[ https://issues.apache.org/jira/browse/SPARK-22653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22653: Assignee: (was: Apache Spark) > executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap > is null > --- > > Key: SPARK-22653 > URL: https://issues.apache.org/jira/browse/SPARK-22653 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Thomas Graves > > In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address > (executorRef.address) can be null. > val data = new ExecutorData(executorRef, executorRef.address, hostname, > cores, cores, logUrls) > At this point the executorRef.address can be null, there is actually code > above it that handles this case: > // If the executor's rpc env is not listening for incoming connections, > `hostPort` > // will be null, and the client connection should be used to > contact the executor. > val executorAddress = if (executorRef.address != null) { > executorRef.address > } else { > context.senderAddress > } > But it doesn't use executorAddress when it creates the ExecutorData. > This causes removeExecutor to never remove it properly from > addressToExecutorId. > addressToExecutorId -= executorInfo.executorAddress > This is also a memory leak and can also call onDisconnected to call > disableExecutor when it shouldn't. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null
[ https://issues.apache.org/jira/browse/SPARK-22653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271244#comment-16271244 ] Apache Spark commented on SPARK-22653: -- User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/19849 > executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap > is null > --- > > Key: SPARK-22653 > URL: https://issues.apache.org/jira/browse/SPARK-22653 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Thomas Graves > > In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address > (executorRef.address) can be null. > val data = new ExecutorData(executorRef, executorRef.address, hostname, > cores, cores, logUrls) > At this point the executorRef.address can be null, there is actually code > above it that handles this case: > // If the executor's rpc env is not listening for incoming connections, > `hostPort` > // will be null, and the client connection should be used to > contact the executor. > val executorAddress = if (executorRef.address != null) { > executorRef.address > } else { > context.senderAddress > } > But it doesn't use executorAddress when it creates the ExecutorData. > This causes removeExecutor to never remove it properly from > addressToExecutorId. > addressToExecutorId -= executorInfo.executorAddress > This is also a memory leak and can also call onDisconnected to call > disableExecutor when it shouldn't. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22599) Avoid extra reading for cached table
[ https://issues.apache.org/jira/browse/SPARK-22599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nan Zhu updated SPARK-22599: Description: In the current implementation of Spark, InMemoryTableExec read all data in a cached table, filter CachedBatch according to stats and pass data to the downstream operators. This implementation makes it inefficient to reside the whole table in memory to serve various queries against different partitions of the table, which occupies a certain portion of our users' scenarios. The following is an example of such a use case: store_sales is a 1TB-sized table in cloud storage, which is partitioned by 'location'. The first query, Q1, wants to output several metrics A, B, C for all stores in all locations. After that, a small team of 3 data scientists wants to do some causal analysis for the sales in different locations. To avoid unnecessary I/O and parquet/orc parsing overhead, they want to cache the whole table in memory in Q1. With the current implementation, even any one of the data scientists is only interested in one out of three locations, the queries they submit to Spark cluster is still reading 1TB data completely. The reason behind the extra reading operation is that we implement CachedBatch as {code} case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: InternalRow) {code} where "stats" is a part of every CachedBatch, so we can only filter batches for output of InMemoryTableExec operator by reading all data in in-memory table as input. The extra reading would be even more unacceptable when some of the table's data is evicted to disks. We propose to introduce a new type of block, metadata block, for the partitions of RDD representing data in the cached table. Every metadata block contains stats info for all columns in a partition and is saved to BlockManager when executing compute() method for the partition. To minimize the number of bytes to read, More details can be found in design doc:https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing performance test results: Environment: 6 Executors, each of which has 16 cores 90G memory dataset: 1T TPCDS data queries: tested 4 queries (Q19, Q46, Q34, Q27) in https://github.com/databricks/spark-sql-perf/blob/c2224f37e50628c5c8691be69414ec7f5a3d919a/src/main/scala/com/databricks/spark/sql/perf/tpcds/ImpalaKitQueries.scala results: https://docs.google.com/spreadsheets/d/1A20LxqZzAxMjW7ptAJZF4hMBaHxKGk3TBEQoAJXfzCI/edit?usp=sharing was: In the current implementation of Spark, InMemoryTableExec read all data in a cached table, filter CachedBatch according to stats and pass data to the downstream operators. This implementation makes it inefficient to reside the whole table in memory to serve various queries against different partitions of the table, which occupies a certain portion of our users' scenarios. The following is an example of such a use case: store_sales is a 1TB-sized table in cloud storage, which is partitioned by 'location'. The first query, Q1, wants to output several metrics A, B, C for all stores in all locations. After that, a small team of 3 data scientists wants to do some causal analysis for the sales in different locations. To avoid unnecessary I/O and parquet/orc parsing overhead, they want to cache the whole table in memory in Q1. With the current implementation, even any one of the data scientists is only interested in one out of three locations, the queries they submit to Spark cluster is still reading 1TB data completely. The reason behind the extra reading operation is that we implement CachedBatch as {code} case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: InternalRow) {code} where "stats" is a part of every CachedBatch, so we can only filter batches for output of InMemoryTableExec operator by reading all data in in-memory table as input. The extra reading would be even more unacceptable when some of the table's data is evicted to disks. We propose to introduce a new type of block, metadata block, for the partitions of RDD representing data in the cached table. Every metadata block contains stats info for all columns in a partition and is saved to BlockManager when executing compute() method for the partition. To minimize the number of bytes to read, More details can be found in design doc:https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing > Avoid extra reading for cached table > > > Key: SPARK-22599 > URL: https://issues.apache.org/jira/browse/SPARK-22599 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Nan Zhu > > In the current implementation of Spark, InMemoryTableExec read
[jira] [Commented] (SPARK-22162) Executors and the driver use inconsistent Job IDs during the new RDD commit protocol
[ https://issues.apache.org/jira/browse/SPARK-22162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271154#comment-16271154 ] Apache Spark commented on SPARK-22162: -- User 'rezasafi' has created a pull request for this issue: https://github.com/apache/spark/pull/19848 > Executors and the driver use inconsistent Job IDs during the new RDD commit > protocol > > > Key: SPARK-22162 > URL: https://issues.apache.org/jira/browse/SPARK-22162 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Reza Safi > > After SPARK-18191 commit in pull request 15769, using the new commit protocol > it is possible that driver and executors uses different jobIds during a rdd > commit. > In the old code, the variable stageId is part of the closure used to define > the task as you can see here: > > [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1098] > As a result, a TaskAttemptId is constructed in executors using the same > "stageId" as the driver, since it is a value that is serialized in the > driver. Also the value of stageID is actually the rdd.id which is assigned > here: > [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1084] > However, after the change in pull request 15769, the value is no longer part > of the task closure, which gets serialized by the driver. Instead, it is > pulled from the taskContext as you can see > here:[https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R103] > and then that value is used to construct the TaskAttemptId on the executors: > [https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R134] > taskContext has a stageID value which will be set in DAGScheduler. So after > the change unlike the old code which a rdd.id was used, an actual stage.id is > used which can be different between executors and the driver since it is no > longer serialized. > In summary, the old code consistently used rddId, and just incorrectly named > it "stageId". > The new code uses a mix of rddId and stageId. There should be a consistent ID > between executors and the drivers. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22615) Handle more cases in PropagateEmptyRelation
[ https://issues.apache.org/jira/browse/SPARK-22615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22615. - Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 2.3.0 > Handle more cases in PropagateEmptyRelation > > > Key: SPARK-22615 > URL: https://issues.apache.org/jira/browse/SPARK-22615 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang > Fix For: 2.3.0 > > > Currently, in the optimize rule `PropagateEmptyRelation`, the following cases > is not handled: > 1. empty relation as right child in left outer join > 2. empty relation as left child in right outer join > 3. empty relation as right child in left semi join > 4. empty relation as right child in left anti join > case #1 and #2 can be treated as Cartesian product and cause exception. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22641) Pyspark UDF relying on column added with withColumn after distinct
[ https://issues.apache.org/jira/browse/SPARK-22641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271085#comment-16271085 ] Apache Spark commented on SPARK-22641: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/19680 > Pyspark UDF relying on column added with withColumn after distinct > -- > > Key: SPARK-22641 > URL: https://issues.apache.org/jira/browse/SPARK-22641 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Andrew Duffy > > We seem to have found an issue with PySpark UDFs interacting with > {{withColumn}} when the UDF depends on the column added in {{withColumn}}, > but _only_ if {{withColumn}} is performed after a {{distinct()}}. > Simplest repro in a local PySpark shell: > {code} > import pyspark.sql.functions as F > @F.udf > def ident(x): > return x > spark.createDataFrame([{'a': '1'}]) \ > .distinct() \ > .withColumn('b', F.lit('qq')) \ > .withColumn('fails_here', ident('b')) \ > .collect() > {code} > This fails with the following exception: > {code} > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: pythonUDF0#13 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:513) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:659) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:164) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:374) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:422) > at >
[jira] [Assigned] (SPARK-22641) Pyspark UDF relying on column added with withColumn after distinct
[ https://issues.apache.org/jira/browse/SPARK-22641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22641: Assignee: Apache Spark > Pyspark UDF relying on column added with withColumn after distinct > -- > > Key: SPARK-22641 > URL: https://issues.apache.org/jira/browse/SPARK-22641 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Andrew Duffy >Assignee: Apache Spark > > We seem to have found an issue with PySpark UDFs interacting with > {{withColumn}} when the UDF depends on the column added in {{withColumn}}, > but _only_ if {{withColumn}} is performed after a {{distinct()}}. > Simplest repro in a local PySpark shell: > {code} > import pyspark.sql.functions as F > @F.udf > def ident(x): > return x > spark.createDataFrame([{'a': '1'}]) \ > .distinct() \ > .withColumn('b', F.lit('qq')) \ > .withColumn('fails_here', ident('b')) \ > .collect() > {code} > This fails with the following exception: > {code} > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: pythonUDF0#13 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:513) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:659) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:164) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:374) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:422) > at >
[jira] [Assigned] (SPARK-22641) Pyspark UDF relying on column added with withColumn after distinct
[ https://issues.apache.org/jira/browse/SPARK-22641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22641: Assignee: (was: Apache Spark) > Pyspark UDF relying on column added with withColumn after distinct > -- > > Key: SPARK-22641 > URL: https://issues.apache.org/jira/browse/SPARK-22641 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Andrew Duffy > > We seem to have found an issue with PySpark UDFs interacting with > {{withColumn}} when the UDF depends on the column added in {{withColumn}}, > but _only_ if {{withColumn}} is performed after a {{distinct()}}. > Simplest repro in a local PySpark shell: > {code} > import pyspark.sql.functions as F > @F.udf > def ident(x): > return x > spark.createDataFrame([{'a': '1'}]) \ > .distinct() \ > .withColumn('b', F.lit('qq')) \ > .withColumn('fails_here', ident('b')) \ > .collect() > {code} > This fails with the following exception: > {code} > : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: pythonUDF0#13 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:513) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:659) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:164) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:374) > at > org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:422) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) > at >
[jira] [Commented] (SPARK-22625) Properly cleanup inheritable thread-locals
[ https://issues.apache.org/jira/browse/SPARK-22625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271041#comment-16271041 ] Sean Owen commented on SPARK-22625: --- I don't have any additional info for you. You can propose a PR. The issue is in part caused by a third party library creating threads though. If it's a clean improvement to Spark, OK, but not really something to 'work around'. > Properly cleanup inheritable thread-locals > -- > > Key: SPARK-22625 > URL: https://issues.apache.org/jira/browse/SPARK-22625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Tolstopyatov Vsevolod > Labels: leak > > Memory leak is present due to inherited thread locals, SPARK-20558 didn't > fixed it properly. > Our production application has the following logic: one thread is reading > from HDFS and another one creates spark context, processes HDFS files and > then closes it on regular schedule. > Depending on what thread started first, SparkContext thread local may or may > not be inherited by HDFS-daemon (DataStreamer), causing memory leak when > streamer was created after spark context. Memory consumption increases every > time new spark context is created, related yourkit paths: > https://screencast.com/t/tgFBYMEpW > The problem is more general and is not related to HDFS in particular. > Proper fix: register all cloned properties (in `localProperties#childValue`) > in ConcurrentHashMap and forcefully clear all of them in `SparkContext#close` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22625) Properly cleanup inheritable thread-locals
[ https://issues.apache.org/jira/browse/SPARK-22625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271038#comment-16271038 ] Tolstopyatov Vsevolod commented on SPARK-22625: --- Ping [~srowen] > Properly cleanup inheritable thread-locals > -- > > Key: SPARK-22625 > URL: https://issues.apache.org/jira/browse/SPARK-22625 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Tolstopyatov Vsevolod > Labels: leak > > Memory leak is present due to inherited thread locals, SPARK-20558 didn't > fixed it properly. > Our production application has the following logic: one thread is reading > from HDFS and another one creates spark context, processes HDFS files and > then closes it on regular schedule. > Depending on what thread started first, SparkContext thread local may or may > not be inherited by HDFS-daemon (DataStreamer), causing memory leak when > streamer was created after spark context. Memory consumption increases every > time new spark context is created, related yourkit paths: > https://screencast.com/t/tgFBYMEpW > The problem is more general and is not related to HDFS in particular. > Proper fix: register all cloned properties (in `localProperties#childValue`) > in ConcurrentHashMap and forcefully clear all of them in `SparkContext#close` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22622) OutOfMemory thrown by Closure Serializer without proper failure propagation
[ https://issues.apache.org/jira/browse/SPARK-22622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22622. --- Resolution: Duplicate It looks like, nevertheless, something huge is in your closure. It's probably not the data you think you're broadcasting. This is then at best part of SPARK-6235. This error would kill the driver process, unless you are trying to recover it. You'd have to post more detail about what you see. > OutOfMemory thrown by Closure Serializer without proper failure propagation > --- > > Key: SPARK-22622 > URL: https://issues.apache.org/jira/browse/SPARK-22622 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Spark 2.2.0 > Hadoop 2.9.0 >Reporter: Raghavendra >Priority: Critical > > While moving from a Stage to another, the Closure serializer is trying to > Serialize the Closures and throwing OOMs. > This is happening when the RDD size crosses 70 GB. > I set the Driver Memory to 225 GB and yet the error persist. > There are two issues here > * OOM thrown when there is almost 3 times of Driver memory provided than the > last Stage RDD size.(Even tried caching this into the disk before moving it > into the current stage) > * After the Error is thrown, the Spark Job does not exit. it just continues > in the same state without propagating the error into the Spark UI. > *Scenario 1* > {color:red}Exception in thread "dag-scheduler-event-loop" > java.lang.OutOfMemoryError: Requested array size exceeds VM limit > at java.util.Arrays.copyOf(Arrays.java:3236) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) > at > org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1003) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1677) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {color} > *Scenario 2* > {color:red} >Exception in thread "dag-scheduler-event-loop" > java.lang.OutOfMemoryError > at > java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123) > at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117) > at > java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) > at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153) > at > org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41) > at > java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877) > at > java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) > at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1003) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874) > at >
[jira] [Commented] (SPARK-22636) row count not being set correctly (always 0) after Statement.executeUpdate()
[ https://issues.apache.org/jira/browse/SPARK-22636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271013#comment-16271013 ] Sergio Lob commented on SPARK-22636: OK > row count not being set correctly (always 0) after Statement.executeUpdate() > > > Key: SPARK-22636 > URL: https://issues.apache.org/jira/browse/SPARK-22636 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 2.2.0 > Environment: Linux lnxx64r7 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 > 11:16:57 EDT 2014 x86_64 x > 86_64 x86_64 GNU/Linux >Reporter: Sergio Lob >Priority: Minor > > This is the similar complaint as HIVE-8244 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency
[ https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271012#comment-16271012 ] Sean Owen commented on SPARK-22634: --- Right, I see core depends on jets3t, so it's not just provided by Hadoop. Looks reasonable, but maybe worth bumping jets3t to 0.9.4 as well; it fixes some bugs and also ups the version of bouncy castle it wants, which may avoid problems with bumping bouncy castle yet further. > Update Bouncy castle dependency > --- > > Key: SPARK-22634 > URL: https://issues.apache.org/jira/browse/SPARK-22634 > Project: Spark > Issue Type: Task > Components: Spark Core, SQL, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Lior Regev >Priority: Minor > > Spark's usage of jets3t library as well as Spark's own Flume and Kafka > streaming uses bouncy castle version 1.51 > This is an outdated version as the latest one is 1.58 > This, in turn renders packages such as > [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds] > unusable since these require 1.58 and spark's distributions come along with > 1.51 > My own attempt was to run on EMR, and since I automatically get all of > spark's dependecies (bouncy castle 1.51 being one of them) into the > classpath, using the library to parse blockchain data failed due to missing > functionality. > I have also opened an > [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency] > with jets3t to update their dependecy as well, but along with that Spark > would have to update it's own or at least be packaged with a newer version -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22646) Spark on Kubernetes - basic submission client
[ https://issues.apache.org/jira/browse/SPARK-22646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22646: Assignee: (was: Apache Spark) > Spark on Kubernetes - basic submission client > - > > Key: SPARK-22646 > URL: https://issues.apache.org/jira/browse/SPARK-22646 > Project: Spark > Issue Type: Sub-task > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan > > The submission client is responsible for creating the Kubernetes pod that > runs the Spark driver. It is a set of client-side changes to enable the > scheduler backend. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22646) Spark on Kubernetes - basic submission client
[ https://issues.apache.org/jira/browse/SPARK-22646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271011#comment-16271011 ] Apache Spark commented on SPARK-22646: -- User 'liyinan926' has created a pull request for this issue: https://github.com/apache/spark/pull/19717 > Spark on Kubernetes - basic submission client > - > > Key: SPARK-22646 > URL: https://issues.apache.org/jira/browse/SPARK-22646 > Project: Spark > Issue Type: Sub-task > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan > > The submission client is responsible for creating the Kubernetes pod that > runs the Spark driver. It is a set of client-side changes to enable the > scheduler backend. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22646) Spark on Kubernetes - basic submission client
[ https://issues.apache.org/jira/browse/SPARK-22646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22646: Assignee: Apache Spark > Spark on Kubernetes - basic submission client > - > > Key: SPARK-22646 > URL: https://issues.apache.org/jira/browse/SPARK-22646 > Project: Spark > Issue Type: Sub-task > Components: Scheduler >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Assignee: Apache Spark > > The submission client is responsible for creating the Kubernetes pod that > runs the Spark driver. It is a set of client-side changes to enable the > scheduler backend. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22641) Pyspark UDF relying on column added with withColumn after distinct
[ https://issues.apache.org/jira/browse/SPARK-22641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Duffy updated SPARK-22641: - Description: We seem to have found an issue with PySpark UDFs interacting with {{withColumn}} when the UDF depends on the column added in {{withColumn}}, but _only_ if {{withColumn}} is performed after a {{distinct()}}. Simplest repro in a local PySpark shell: {code} import pyspark.sql.functions as F @F.udf def ident(x): return x spark.createDataFrame([{'a': '1'}]) \ .distinct() \ .withColumn('b', F.lit('qq')) \ .withColumn('fails_here', ident('b')) \ .collect() {code} This fails with the following exception: {code} : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#13 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:513) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:659) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:164) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138) at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38) at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:374) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:422) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:233) at
[jira] [Commented] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null
[ https://issues.apache.org/jira/browse/SPARK-22653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270977#comment-16270977 ] Thomas Graves commented on SPARK-22653: --- will have patch up shortly > executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap > is null > --- > > Key: SPARK-22653 > URL: https://issues.apache.org/jira/browse/SPARK-22653 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Thomas Graves > > In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address > (executorRef.address) can be null. > val data = new ExecutorData(executorRef, executorRef.address, hostname, > cores, cores, logUrls) > At this point the executorRef.address can be null, there is actually code > above it that handles this case: > // If the executor's rpc env is not listening for incoming connections, > `hostPort` > // will be null, and the client connection should be used to > contact the executor. > val executorAddress = if (executorRef.address != null) { > executorRef.address > } else { > context.senderAddress > } > But it doesn't use executorAddress when it creates the ExecutorData. > This causes removeExecutor to never remove it properly from > addressToExecutorId. > addressToExecutorId -= executorInfo.executorAddress > This is also a memory leak and can also call onDisconnected to call > disableExecutor when it shouldn't. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null
Thomas Graves created SPARK-22653: - Summary: executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null Key: SPARK-22653 URL: https://issues.apache.org/jira/browse/SPARK-22653 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.2.0 Reporter: Thomas Graves In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address (executorRef.address) can be null. val data = new ExecutorData(executorRef, executorRef.address, hostname, cores, cores, logUrls) At this point the executorRef.address can be null, there is actually code above it that handles this case: // If the executor's rpc env is not listening for incoming connections, `hostPort` // will be null, and the client connection should be used to contact the executor. val executorAddress = if (executorRef.address != null) { executorRef.address } else { context.senderAddress } But it doesn't use executorAddress when it creates the ExecutorData. This causes removeExecutor to never remove it properly from addressToExecutorId. addressToExecutorId -= executorInfo.executorAddress This is also a memory leak and can also call onDisconnected to call disableExecutor when it shouldn't. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22652) remove set methods in ColumnarRow
[ https://issues.apache.org/jira/browse/SPARK-22652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22652: Assignee: Wenchen Fan (was: Apache Spark) > remove set methods in ColumnarRow > - > > Key: SPARK-22652 > URL: https://issues.apache.org/jira/browse/SPARK-22652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22652) remove set methods in ColumnarRow
[ https://issues.apache.org/jira/browse/SPARK-22652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270953#comment-16270953 ] Apache Spark commented on SPARK-22652: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19847 > remove set methods in ColumnarRow > - > > Key: SPARK-22652 > URL: https://issues.apache.org/jira/browse/SPARK-22652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22636) row count not being set correctly (always 0) after Statement.executeUpdate()
[ https://issues.apache.org/jira/browse/SPARK-22636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270956#comment-16270956 ] Sean Owen commented on SPARK-22636: --- Yes that's my understanding. We can wait a beat here to see if someone with more in-depth knowledge of this has a different opinion, but I believe this is foremost a Hive issue. > row count not being set correctly (always 0) after Statement.executeUpdate() > > > Key: SPARK-22636 > URL: https://issues.apache.org/jira/browse/SPARK-22636 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 2.2.0 > Environment: Linux lnxx64r7 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 > 11:16:57 EDT 2014 x86_64 x > 86_64 x86_64 GNU/Linux >Reporter: Sergio Lob >Priority: Minor > > This is the similar complaint as HIVE-8244 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22652) remove set methods in ColumnarRow
[ https://issues.apache.org/jira/browse/SPARK-22652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22652: Assignee: Apache Spark (was: Wenchen Fan) > remove set methods in ColumnarRow > - > > Key: SPARK-22652 > URL: https://issues.apache.org/jira/browse/SPARK-22652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22652) remove set methods in ColumnarRow
Wenchen Fan created SPARK-22652: --- Summary: remove set methods in ColumnarRow Key: SPARK-22652 URL: https://issues.apache.org/jira/browse/SPARK-22652 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22641) Pyspark UDF relying on column added with withColumn after distinct
[ https://issues.apache.org/jira/browse/SPARK-22641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Duffy updated SPARK-22641: - Description: We seem to have found an issue with PySpark UDFs interacting with {{withColumn}} when the UDF depends on the column added in {{withColumn}}, but _only_ if {{withColumn}} is performed after a {{distinct()}}. Simplest repro in a local PySpark shell: {code} import pyspark.sql.functions as F @F.udf def ident(x): return x spark.createDataFrame([{'a': '1'}]) \ .distinct() \ .withColumn('b', F.lit('qq')) \ .withColumn('fails_here', ident('b')) \ .collect() {code} This fails with the following exception: {code} py4j.protocol.Py4JJavaError: An error occurred while calling o263.collectToPython. : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#97 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.immutable.List.map(List.scala:285) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:513) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:659) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:164) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85) at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138) at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38) at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:374) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:422) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at
[jira] [Commented] (SPARK-22636) row count not being set correctly (always 0) after Statement.executeUpdate()
[ https://issues.apache.org/jira/browse/SPARK-22636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270920#comment-16270920 ] Sergio Lob commented on SPARK-22636: Since we are using the Hive JDBC driver to access Spark, I guess that if there would be a JDBC fix, it would be in the Hive JDBC driver. Also, since Spark mimics Hive's functionality, I suppose you're implying that the functionality would have to be implemented in Hive first before being considered for Spark. Does that sound correct? > row count not being set correctly (always 0) after Statement.executeUpdate() > > > Key: SPARK-22636 > URL: https://issues.apache.org/jira/browse/SPARK-22636 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 2.2.0 > Environment: Linux lnxx64r7 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 > 11:16:57 EDT 2014 x86_64 x > 86_64 x86_64 GNU/Linux >Reporter: Sergio Lob >Priority: Minor > > This is the similar complaint as HIVE-8244 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
[ https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270912#comment-16270912 ] Mark Petruska commented on SPARK-22393: --- [~srowen], [~rdub], I can now confirm that the original bug fix that was pushed to Scala 2.12 fixes this issue. Succeeded in retrofitting the same changes into Spark-shell, see: https://github.com/apache/spark/pull/19846. The original fix for Scala 2.12 can be found at: https://github.com/scala/scala/pull/5640 The downside is that the code/fix is not the most approachable, could not refactor for better readability (and also making sure it compiles :) ). > spark-shell can't find imported types in class constructors, extends clause > --- > > Key: SPARK-22393 > URL: https://issues.apache.org/jira/browse/SPARK-22393 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > {code} > $ spark-shell > … > scala> import org.apache.spark.Partition > import org.apache.spark.Partition > scala> class P(p: Partition) > :11: error: not found: type Partition >class P(p: Partition) > ^ > scala> class P(val index: Int) extends Partition > :11: error: not found: type Partition >class P(val index: Int) extends Partition >^ > {code} > Any class that I {{import}} gives "not found: type ___" when used as a > parameter to a class, or in an extends clause; this applies to classes I > import from JARs I provide via {{--jars}} as well as core Spark classes as > above. > This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
[ https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22393: Assignee: Apache Spark > spark-shell can't find imported types in class constructors, extends clause > --- > > Key: SPARK-22393 > URL: https://issues.apache.org/jira/browse/SPARK-22393 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Ryan Williams >Assignee: Apache Spark >Priority: Minor > > {code} > $ spark-shell > … > scala> import org.apache.spark.Partition > import org.apache.spark.Partition > scala> class P(p: Partition) > :11: error: not found: type Partition >class P(p: Partition) > ^ > scala> class P(val index: Int) extends Partition > :11: error: not found: type Partition >class P(val index: Int) extends Partition >^ > {code} > Any class that I {{import}} gives "not found: type ___" when used as a > parameter to a class, or in an extends clause; this applies to classes I > import from JARs I provide via {{--jars}} as well as core Spark classes as > above. > This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
[ https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270903#comment-16270903 ] Apache Spark commented on SPARK-22393: -- User 'mpetruska' has created a pull request for this issue: https://github.com/apache/spark/pull/19846 > spark-shell can't find imported types in class constructors, extends clause > --- > > Key: SPARK-22393 > URL: https://issues.apache.org/jira/browse/SPARK-22393 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > {code} > $ spark-shell > … > scala> import org.apache.spark.Partition > import org.apache.spark.Partition > scala> class P(p: Partition) > :11: error: not found: type Partition >class P(p: Partition) > ^ > scala> class P(val index: Int) extends Partition > :11: error: not found: type Partition >class P(val index: Int) extends Partition >^ > {code} > Any class that I {{import}} gives "not found: type ___" when used as a > parameter to a class, or in an extends clause; this applies to classes I > import from JARs I provide via {{--jars}} as well as core Spark classes as > above. > This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
[ https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22393: Assignee: (was: Apache Spark) > spark-shell can't find imported types in class constructors, extends clause > --- > > Key: SPARK-22393 > URL: https://issues.apache.org/jira/browse/SPARK-22393 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > {code} > $ spark-shell > … > scala> import org.apache.spark.Partition > import org.apache.spark.Partition > scala> class P(p: Partition) > :11: error: not found: type Partition >class P(p: Partition) > ^ > scala> class P(val index: Int) extends Partition > :11: error: not found: type Partition >class P(val index: Int) extends Partition >^ > {code} > Any class that I {{import}} gives "not found: type ___" when used as a > parameter to a class, or in an extends clause; this applies to classes I > import from JARs I provide via {{--jars}} as well as core Spark classes as > above. > This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause
[ https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270897#comment-16270897 ] Sean Owen commented on SPARK-22393: --- OK, so this is basically "fixed for Scala 2.12 only"? > spark-shell can't find imported types in class constructors, extends clause > --- > > Key: SPARK-22393 > URL: https://issues.apache.org/jira/browse/SPARK-22393 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.2, 2.1.2, 2.2.0 >Reporter: Ryan Williams >Priority: Minor > > {code} > $ spark-shell > … > scala> import org.apache.spark.Partition > import org.apache.spark.Partition > scala> class P(p: Partition) > :11: error: not found: type Partition >class P(p: Partition) > ^ > scala> class P(val index: Int) extends Partition > :11: error: not found: type Partition >class P(val index: Int) extends Partition >^ > {code} > Any class that I {{import}} gives "not found: type ___" when used as a > parameter to a class, or in an extends clause; this applies to classes I > import from JARs I provide via {{--jars}} as well as core Spark classes as > above. > This worked in 1.6.3 but has been broken since 2.0.0. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22636) row count not being set correctly (always 0) after Statement.executeUpdate()
[ https://issues.apache.org/jira/browse/SPARK-22636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270895#comment-16270895 ] Sean Owen commented on SPARK-22636: --- No. I'm saying the ticket you linked to does not sound like a bug (though it's marked as such and you marked this one as such). It's a behavior change. It's also unresolved -- doesn't mean unresolvable, but also means it is not something even Hive does now, and Spark generally matches Hive's semantics and functionality. I also am not clear where you mean Spark needs to implement this. It does not implement a JDBC API like Statement. > row count not being set correctly (always 0) after Statement.executeUpdate() > > > Key: SPARK-22636 > URL: https://issues.apache.org/jira/browse/SPARK-22636 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 2.2.0 > Environment: Linux lnxx64r7 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 > 11:16:57 EDT 2014 x86_64 x > 86_64 x86_64 GNU/Linux >Reporter: Sergio Lob >Priority: Minor > > This is the similar complaint as HIVE-8244 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22636) row count not being set correctly (always 0) after Statement.executeUpdate()
[ https://issues.apache.org/jira/browse/SPARK-22636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270865#comment-16270865 ] Sergio Lob commented on SPARK-22636: Are you implying that its not "fixable" in neither Hive nor Spark? > row count not being set correctly (always 0) after Statement.executeUpdate() > > > Key: SPARK-22636 > URL: https://issues.apache.org/jira/browse/SPARK-22636 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 2.2.0 > Environment: Linux lnxx64r7 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 > 11:16:57 EDT 2014 x86_64 x > 86_64 x86_64 GNU/Linux >Reporter: Sergio Lob >Priority: Minor > > This is the similar complaint as HIVE-8244 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22633) spark-submit.cmd cannot handle long arguments
[ https://issues.apache.org/jira/browse/SPARK-22633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270753#comment-16270753 ] Hyukjin Kwon commented on SPARK-22633: -- {{spark-submit2.cmd}} is only there for the purpose of isolating environment problem BTW. Calling {{2.cmd}} script is fine. > spark-submit.cmd cannot handle long arguments > - > > Key: SPARK-22633 > URL: https://issues.apache.org/jira/browse/SPARK-22633 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.1.1 > Environment: Windows 7 x64 >Reporter: Olivier Sannier > Labels: windows > > Hello, > Under Windows, one would use spark-submit.cmd with the parameters required to > submit a program to Spark which has the following implementation: > {{cmd /V /E /C "%~dp0spark-submit2.cmd" %*}} > This spawns a second shell to ensure changes to the environment are local to > the script and do not leak to the caller. > But this has a major drawback as it hits the 2048 characters limit for a > cmd.exe argument: > https://support.microsoft.com/en-us/help/830473/command-prompt-cmd--exe-command-line-string-limitation > One workaround is to call {{spark-submit2.cmd}} directly but it means a > specific command for Windows usage. > The other solution is to remove the call to {{cmd}} and replace it with a > call to {{setlocal}} before calling {{spark-submit2.cmd}} leading to this > code: > {{setlocal}} > {{"%~dp0spark-submit2.cmd" %*}} > Using this here solved the issue altogether but I'm not sure it can be > applied to older Windows versions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22651) Calling ImageSchema.readImages initiate multiple Hive clients
[ https://issues.apache.org/jira/browse/SPARK-22651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22651: Assignee: Apache Spark > Calling ImageSchema.readImages initiate multiple Hive clients > - > > Key: SPARK-22651 > URL: https://issues.apache.org/jira/browse/SPARK-22651 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > While playing with images, I realised calling {{ImageSchema.readImages}} > multiple times seems attempting to create multiple Hive clients. > {code} > from pyspark.ml.image import ImageSchema > data_path = 'data/mllib/images/kittens' > _ = ImageSchema.readImages(data_path, recursive=True, > dropImageFailures=True).collect() > _ = ImageSchema.readImages(data_path, recursive=True, > dropImageFailures=True).collect() > {code} > {code} > ... > org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test > connection to the given database. JDBC url = > jdbc:derby:;databaseName=metastore_db;create=true, username = APP. > Terminating connection pool (set lazyInit to true if you expect to start your > database after your app). Original Exception: -- > java.sql.SQLException: Failed to start database 'metastore_db' with class > loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see > the next exception for details. > ... > at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) > ... > at > org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) > ... > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180) > ... > at > org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348) > at > org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253) > ... > Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class > loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see > the next exception for details. > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown > Source) > ... 121 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /.../spark/metastore_db. > ... > Traceback (most recent call last): > File "", line 1, in > File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages > dropImageFailures, float(sampleRatio), seed) > File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line > 1160, in __call__ > File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22650) spark2.2 on yarn streaming can't connect hbase
[ https://issues.apache.org/jira/browse/SPARK-22650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22650. --- Resolution: Invalid Questions to the mailing list please > spark2.2 on yarn streaming can't connect hbase > -- > > Key: SPARK-22650 > URL: https://issues.apache.org/jira/browse/SPARK-22650 > Project: Spark > Issue Type: Bug > Components: SparkR, YARN >Affects Versions: 2.2.0 > Environment: HDFS 2.6.0+cdh5.5.2+992 > HttpFS2.6.0+cdh5.5.2+992 > YARN 2.6.0+cdh5.5.2+992 > HBase 1.0.0+cdh5.5.2+297 > Sparkspark-2.2.0-bin-hadoop2.6 >Reporter: ZHOUBEIHUA > Original Estimate: 96h > Remaining Estimate: 96h > > HI, > We can't use Spark streaming to connect hbase in kerberos with spark token . > Can you give some advice to use spark self method non hbase UGI to connect > hbase . -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22651) Calling ImageSchema.readImages initiate multiple Hive clients
[ https://issues.apache.org/jira/browse/SPARK-22651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22651: Assignee: (was: Apache Spark) > Calling ImageSchema.readImages initiate multiple Hive clients > - > > Key: SPARK-22651 > URL: https://issues.apache.org/jira/browse/SPARK-22651 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > While playing with images, I realised calling {{ImageSchema.readImages}} > multiple times seems attempting to create multiple Hive clients. > {code} > from pyspark.ml.image import ImageSchema > data_path = 'data/mllib/images/kittens' > _ = ImageSchema.readImages(data_path, recursive=True, > dropImageFailures=True).collect() > _ = ImageSchema.readImages(data_path, recursive=True, > dropImageFailures=True).collect() > {code} > {code} > ... > org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test > connection to the given database. JDBC url = > jdbc:derby:;databaseName=metastore_db;create=true, username = APP. > Terminating connection pool (set lazyInit to true if you expect to start your > database after your app). Original Exception: -- > java.sql.SQLException: Failed to start database 'metastore_db' with class > loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see > the next exception for details. > ... > at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) > ... > at > org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) > ... > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180) > ... > at > org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348) > at > org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253) > ... > Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class > loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see > the next exception for details. > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown > Source) > ... 121 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /.../spark/metastore_db. > ... > Traceback (most recent call last): > File "", line 1, in > File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages > dropImageFailures, float(sampleRatio), seed) > File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line > 1160, in __call__ > File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22651) Calling ImageSchema.readImages initiate multiple Hive clients
[ https://issues.apache.org/jira/browse/SPARK-22651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270721#comment-16270721 ] Apache Spark commented on SPARK-22651: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/19845 > Calling ImageSchema.readImages initiate multiple Hive clients > - > > Key: SPARK-22651 > URL: https://issues.apache.org/jira/browse/SPARK-22651 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > While playing with images, I realised calling {{ImageSchema.readImages}} > multiple times seems attempting to create multiple Hive clients. > {code} > from pyspark.ml.image import ImageSchema > data_path = 'data/mllib/images/kittens' > _ = ImageSchema.readImages(data_path, recursive=True, > dropImageFailures=True).collect() > _ = ImageSchema.readImages(data_path, recursive=True, > dropImageFailures=True).collect() > {code} > {code} > ... > org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test > connection to the given database. JDBC url = > jdbc:derby:;databaseName=metastore_db;create=true, username = APP. > Terminating connection pool (set lazyInit to true if you expect to start your > database after your app). Original Exception: -- > java.sql.SQLException: Failed to start database 'metastore_db' with class > loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see > the next exception for details. > ... > at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) > ... > at > org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) > ... > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180) > ... > at > org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348) > at > org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253) > ... > Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class > loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see > the next exception for details. > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown > Source) > ... 121 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /.../spark/metastore_db. > ... > Traceback (most recent call last): > File "", line 1, in > File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages > dropImageFailures, float(sampleRatio), seed) > File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line > 1160, in __call__ > File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22651) Calling ImageSchema.readImages initiate multiple Hive clients
[ https://issues.apache.org/jira/browse/SPARK-22651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-22651: - Component/s: ML > Calling ImageSchema.readImages initiate multiple Hive clients > - > > Key: SPARK-22651 > URL: https://issues.apache.org/jira/browse/SPARK-22651 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon > > While playing with images, I realised calling {{ImageSchema.readImages}} > multiple times seems attempting to create multiple Hive clients. > {code} > from pyspark.ml.image import ImageSchema > data_path = 'data/mllib/images/kittens' > _ = ImageSchema.readImages(data_path, recursive=True, > dropImageFailures=True).collect() > _ = ImageSchema.readImages(data_path, recursive=True, > dropImageFailures=True).collect() > {code} > {code} > ... > org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test > connection to the given database. JDBC url = > jdbc:derby:;databaseName=metastore_db;create=true, username = APP. > Terminating connection pool (set lazyInit to true if you expect to start your > database after your app). Original Exception: -- > java.sql.SQLException: Failed to start database 'metastore_db' with class > loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see > the next exception for details. > ... > at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) > ... > at > org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762) > ... > at > org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180) > ... > at > org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348) > at > org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253) > ... > Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class > loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see > the next exception for details. > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown > Source) > ... 121 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /.../spark/metastore_db. > ... > Traceback (most recent call last): > File "", line 1, in > File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages > dropImageFailures, float(sampleRatio), seed) > File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line > 1160, in __call__ > File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco > raise AnalysisException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: > java.lang.RuntimeException: Unable to instantiate > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;' > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22642) the createdTempDir will not be deleted if an exception occurs
[ https://issues.apache.org/jira/browse/SPARK-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-22642: -- Priority: Minor (was: Critical) This is hardly critical > the createdTempDir will not be deleted if an exception occurs > - > > Key: SPARK-22642 > URL: https://issues.apache.org/jira/browse/SPARK-22642 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: zuotingbing >Priority: Minor > > We found staging directories will not be dropped sometimes in our production > environment. > The createdTempDir will not be deleted if an exception occurs, we should > delete createdTempDir in finally. > Refer to SPARK-18703。 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org