[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9

2017-11-29 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272287#comment-16272287
 ] 

liyunzhang commented on SPARK-22660:


not support JDK9 now , i am working on it.

> Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
> 
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> Based on SPARK-22659
> 1. compile with -Pscala-2.12 and get the error
> {code}
> Use position() and limit() to fix ambiguity issue
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9

2017-11-29 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272249#comment-16272249
 ] 

liyunzhang edited comment on SPARK-22660 at 11/30/17 7:39 AM:
--

some new error 
{code}
error] 
/home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415:
 ambiguous reference to overloaded definition,
[error] both method putAll in class Properties of type (x$1: java.util.Map[_, 
_])Unit
[error] and  method putAll in class Hashtable of type (x$1: java.util.Map[_ <: 
Object, _ <: Object])Unit
[error] match argument types (java.util.Map[String,String])
[error] properties.putAll(propsMap.asJava)
[error]^
[error] 
/home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427:
 ambiguous reference to overloaded definition,
[error] both method putAll in class Properties of type (x$1: java.util.Map[_, 
_])Unit
[error] and  method putAll in class Hashtable of type (x$1: java.util.Map[_ <: 
Object, _ <: Object])Unit
[error] match argument types (java.util.Map[String,String])
[error]   props.putAll(outputSerdeProps.toMap.asJava)
[error] ^

{code}

The key type is Object instead of String, which is unsafe.


was (Author: kellyzly):
some new error 
{code}
error] 
/home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415:
 ambiguous reference to overloaded definition,
[error] both method putAll in class Properties of type (x$1: java.util.Map[_, 
_])Unit
[error] and  method putAll in class Hashtable of type (x$1: java.util.Map[_ <: 
Object, _ <: Object])Unit
[error] match argument types (java.util.Map[String,String])
[error] properties.putAll(propsMap.asJava)
[error]^
[error] 
/home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427:
 ambiguous reference to overloaded definition,
[error] both method putAll in class Properties of type (x$1: java.util.Map[_, 
_])Unit
[error] and  method putAll in class Hashtable of type (x$1: java.util.Map[_ <: 
Object, _ <: Object])Unit
[error] match argument types (java.util.Map[String,String])
[error]   props.putAll(outputSerdeProps.toMap.asJava)
[error] ^

{code}

> Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
> 
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> Based on SPARK-22659
> 1. compile with -Pscala-2.12 and get the error
> {code}
> Use position() and limit() to fix ambiguity issue
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9

2017-11-29 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272283#comment-16272283
 ] 

Liang-Chi Hsieh commented on SPARK-22660:
-

For the error you ping me, from the error message, looks like you can try to 
add {{import scala.language.reflectiveCalls}}?

Btw, are we supporting JDK9?

> Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
> 
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> Based on SPARK-22659
> 1. compile with -Pscala-2.12 and get the error
> {code}
> Use position() and limit() to fix ambiguity issue
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9

2017-11-29 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272281#comment-16272281
 ] 

liyunzhang commented on SPARK-22660:


[~viirya]: the error mentioned above does not exist any more when i rebuilt. 
Sorry if you have spent time on it.

> Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
> 
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> Based on SPARK-22659
> 1. compile with -Pscala-2.12 and get the error
> {code}
> Use position() and limit() to fix ambiguity issue
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9

2017-11-29 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272249#comment-16272249
 ] 

liyunzhang commented on SPARK-22660:


some new error 
{code}
error] 
/home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415:
 ambiguous reference to overloaded definition,
[error] both method putAll in class Properties of type (x$1: java.util.Map[_, 
_])Unit
[error] and  method putAll in class Hashtable of type (x$1: java.util.Map[_ <: 
Object, _ <: Object])Unit
[error] match argument types (java.util.Map[String,String])
[error] properties.putAll(propsMap.asJava)
[error]^
[error] 
/home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427:
 ambiguous reference to overloaded definition,
[error] both method putAll in class Properties of type (x$1: java.util.Map[_, 
_])Unit
[error] and  method putAll in class Hashtable of type (x$1: java.util.Map[_ <: 
Object, _ <: Object])Unit
[error] match argument types (java.util.Map[String,String])
[error]   props.putAll(outputSerdeProps.toMap.asJava)
[error] ^

{code}

> Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
> 
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> Based on SPARK-22659
> 1. compile with -Pscala-2.12 and get the error
> {code}
> Use position() and limit() to fix ambiguity issue
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9

2017-11-29 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272225#comment-16272225
 ] 

liyunzhang commented on SPARK-22660:


Besides above error, there are another errors with scala-2.12
{code}
 
/home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite.scala:151:
 reflective access   of structural type member method getData should be 
enabled
  by making the implicit value scala.language.reflectiveCalls visible.
64275  This can be achieved by adding the import clause 'import 
scala.language.reflectiveCalls'
64276  or by setting the compiler option -language:reflectiveCalls.
64277  See the Scaladoc for value scala.language.reflectiveCalls for a 
discussion
64278  why the feature should be explicitly enabled.
64279val rdd = sc.parallelize(1 to 1).map(concreteObject.getData)
64280^
64281  
/home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite.scala:175:
 reflective access   of structural type member value innerObject2 should be 
enabled
64282  by making the implicit value scala.language.reflectiveCalls visible.
64283val rdd = sc.parallelize(1 to 
1).map(concreteObject.innerObject2.getData)
64284^
64285  
/home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/core/src/test/scala/org/apache/spark/util/ClosureCleanerSuite.scala:175:
 reflective access   of structural type member method getData should be 
enabled
64286  by making the implicit value scala.language.reflectiveCalls visible.
64287val rdd = sc.parallelize(1 to 
1).map(concreteObject.innerObject2.getData)
   
{code}
[~viirya]:As you are familar with SPARK-22328, do you know how to fix it? 

> Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9
> 
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> Based on SPARK-22659
> 1. compile with -Pscala-2.12 and get the error
> {code}
> Use position() and limit() to fix ambiguity issue
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12 and JDK9

2017-11-29 Thread liyunzhang (JIRA)
liyunzhang created SPARK-22660:
--

 Summary: Use position() and limit() to fix ambiguity issue in 
scala-2.12 and JDK9
 Key: SPARK-22660
 URL: https://issues.apache.org/jira/browse/SPARK-22660
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.2.0
Reporter: liyunzhang


Based on SPARK-22659

1. compile with -Pscala-2.12 and get the error
{code}
Use position() and limit() to fix ambiguity issue

{code}
spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: 
ambiguous reference to overloaded definition, method limit in class ByteBuffer 
of type (x$1: Int)java.nio.ByteBuffer
method limit in class Buffer of type ()Int
match expected type ?
 val resultSize = serializedDirectResult.limit
error 

{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22659) remove sun.misc.Cleaner references

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22659:


Assignee: (was: Apache Spark)

> remove sun.misc.Cleaner references
> --
>
> Key: SPARK-22659
> URL: https://issues.apache.org/jira/browse/SPARK-22659
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
> {code}
>  ./dev/change-scala-version.sh 2.12
> {code}
> 2.build with -Pscala-2.12
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> get the error
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22659) remove sun.misc.Cleaner references

2017-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272187#comment-16272187
 ] 

Apache Spark commented on SPARK-22659:
--

User 'kellyzly' has created a pull request for this issue:
https://github.com/apache/spark/pull/19853

> remove sun.misc.Cleaner references
> --
>
> Key: SPARK-22659
> URL: https://issues.apache.org/jira/browse/SPARK-22659
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
> {code}
>  ./dev/change-scala-version.sh 2.12
> {code}
> 2.build with -Pscala-2.12
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> get the error
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22659) remove sun.misc.Cleaner references

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22659:


Assignee: Apache Spark

> remove sun.misc.Cleaner references
> --
>
> Key: SPARK-22659
> URL: https://issues.apache.org/jira/browse/SPARK-22659
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>Assignee: Apache Spark
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
> {code}
>  ./dev/change-scala-version.sh 2.12
> {code}
> 2.build with -Pscala-2.12
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> get the error
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22659) remove sun.misc.Cleaner references

2017-11-29 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272177#comment-16272177
 ] 

liyunzhang commented on SPARK-22659:


I am very confused about why this issue exists with java.version is 1.8 in 
pom.xml.
{code}
 #grep -C2 java.version pom.xml
UTF-8
UTF-8
1.8
${java.version}
${java.version}

{code}

> remove sun.misc.Cleaner references
> --
>
> Key: SPARK-22659
> URL: https://issues.apache.org/jira/browse/SPARK-22659
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
> {code}
>  ./dev/change-scala-version.sh 2.12
> {code}
> 2.build with -Pscala-2.12
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> get the error
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22659) remove sun.misc.Cleaner references

2017-11-29 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272123#comment-16272123
 ] 

liyunzhang commented on SPARK-22659:


So when you compile with  -Pscala-2.12 and jdk8, there is no this issue?

> remove sun.misc.Cleaner references
> --
>
> Key: SPARK-22659
> URL: https://issues.apache.org/jira/browse/SPARK-22659
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
> {code}
>  ./dev/change-scala-version.sh 2.12
> {code}
> 2.build with -Pscala-2.12
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> get the error
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22659) remove sun.misc.Cleaner references

2017-11-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272117#comment-16272117
 ] 

Sean Owen commented on SPARK-22659:
---

This isn't related to Scala 2.12 now. JDK 9 isn't supported, and this isn't the 
only reason. I don't think this is a valid issue therefore.

> remove sun.misc.Cleaner references
> --
>
> Key: SPARK-22659
> URL: https://issues.apache.org/jira/browse/SPARK-22659
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
> {code}
>  ./dev/change-scala-version.sh 2.12
> {code}
> 2.build with -Pscala-2.12
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> get the error
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22659) remove sun.misc.Cleaner references

2017-11-29 Thread liyunzhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated SPARK-22659:
---
Description: 
build with scala-2.12 with following steps
1. change the pom.xml with scala-2.12
{code}
 ./dev/change-scala-version.sh 2.12
{code}
2.build with -Pscala-2.12
{code}
./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
-Pparquet-provided -Dhadoop.version=2.7.3
{code}

get the error
{code}
/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: error: 
cannot find   symbol
Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
{code}
This is because sun.misc.Cleaner has been moved to new location in JDK9. 
HADOOP-12760 will be the long term fix

  was:
the artifactId of common/tags/pom.xml and streaming/pom.xml is spark-tags_2.11 
and spark_streaming_2.11 which will causes fail when building with -Pscala-2.12.

Suggest to {{scala.binary.version}} to solve





> remove sun.misc.Cleaner references
> --
>
> Key: SPARK-22659
> URL: https://issues.apache.org/jira/browse/SPARK-22659
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
> {code}
>  ./dev/change-scala-version.sh 2.12
> {code}
> 2.build with -Pscala-2.12
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> get the error
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22659) remove sun.misc.Cleaner references

2017-11-29 Thread liyunzhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang updated SPARK-22659:
---
Summary: remove sun.misc.Cleaner references  (was: Use 
{{scala.binary.version}} in the artifactId in the pom.xml of common/tags and 
streaming)

> remove sun.misc.Cleaner references
> --
>
> Key: SPARK-22659
> URL: https://issues.apache.org/jira/browse/SPARK-22659
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> the artifactId of common/tags/pom.xml and streaming/pom.xml is 
> spark-tags_2.11 and spark_streaming_2.11 which will causes fail when building 
> with -Pscala-2.12.
> Suggest to {{scala.binary.version}} to solve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22659) Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags and streaming

2017-11-29 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272048#comment-16272048
 ] 

liyunzhang edited comment on SPARK-22659 at 11/30/17 2:31 AM:
--

I saw the script in d...@spark.apache.org and will try, thanks!


was (Author: kellyzly):
where is the script?

> Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags 
> and streaming
> --
>
> Key: SPARK-22659
> URL: https://issues.apache.org/jira/browse/SPARK-22659
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> the artifactId of common/tags/pom.xml and streaming/pom.xml is 
> spark-tags_2.11 and spark_streaming_2.11 which will causes fail when building 
> with -Pscala-2.12.
> Suggest to {{scala.binary.version}} to solve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22659) Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags and streaming

2017-11-29 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272048#comment-16272048
 ] 

liyunzhang commented on SPARK-22659:


where is the script?

> Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags 
> and streaming
> --
>
> Key: SPARK-22659
> URL: https://issues.apache.org/jira/browse/SPARK-22659
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> the artifactId of common/tags/pom.xml and streaming/pom.xml is 
> spark-tags_2.11 and spark_streaming_2.11 which will causes fail when building 
> with -Pscala-2.12.
> Suggest to {{scala.binary.version}} to solve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22659) Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags and streaming

2017-11-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272045#comment-16272045
 ] 

Sean Owen commented on SPARK-22659:
---

You can't put vars in the artifact names in Maven.
Believe me if it were that easy we would have done it that way.
The script I mentioned is the hack workaround.

> Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags 
> and streaming
> --
>
> Key: SPARK-22659
> URL: https://issues.apache.org/jira/browse/SPARK-22659
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>
> the artifactId of common/tags/pom.xml and streaming/pom.xml is 
> spark-tags_2.11 and spark_streaming_2.11 which will causes fail when building 
> with -Pscala-2.12.
> Suggest to {{scala.binary.version}} to solve



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22659) Use {{scala.binary.version}} in the artifactId in the pom.xml of common/tags and streaming

2017-11-29 Thread liyunzhang (JIRA)
liyunzhang created SPARK-22659:
--

 Summary: Use {{scala.binary.version}} in the artifactId in the 
pom.xml of common/tags and streaming
 Key: SPARK-22659
 URL: https://issues.apache.org/jira/browse/SPARK-22659
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.2.0
Reporter: liyunzhang


the artifactId of common/tags/pom.xml and streaming/pom.xml is spark-tags_2.11 
and spark_streaming_2.11 which will causes fail when building with -Pscala-2.12.

Suggest to {{scala.binary.version}} to solve






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22630) Consolidate all configuration properties into one page

2017-11-29 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272033#comment-16272033
 ] 

Hyukjin Kwon commented on SPARK-22630:
--

+1 for ^.

> Consolidate all configuration properties into one page
> --
>
> Key: SPARK-22630
> URL: https://issues.apache.org/jira/browse/SPARK-22630
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.2.0
>Reporter: Andreas Maier
>
> The page https://spark.apache.org/docs/2.2.0/configuration.html gives the 
> impression as if all configuration properties of Spark are described on this 
> page. Unfortunately this is not true. The description of important properties 
> is spread through the documentation. The following pages list properties, 
> which are not described on the configuration page: 
> https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#performance-tuning
> https://spark.apache.org/docs/2.2.0/monitoring.html#spark-configuration-options
> https://spark.apache.org/docs/2.2.0/security.html#ssl-configuration
> https://spark.apache.org/docs/2.2.0/sparkr.html#starting-up-from-rstudio
> https://spark.apache.org/docs/2.2.0/running-on-yarn.html#spark-properties
> https://spark.apache.org/docs/2.2.0/running-on-mesos.html#configuration
> https://spark.apache.org/docs/2.2.0/spark-standalone.html#cluster-launch-scripts
> As a reader of the documentation I would like to have single central webpage 
> describing all Spark configuration properties. Alternatively it would be nice 
> to at least add links from the configuration page to the other pages of the 
> documentation, where configuration properties are described. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22656) Upgrade Arrow to 0.8.0

2017-11-29 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16272026#comment-16272026
 ] 

Hyukjin Kwon commented on SPARK-22656:
--

Hi [~zsxwing], seems a duplicate of SPARK-22324.

> Upgrade Arrow to 0.8.0
> --
>
> Key: SPARK-22656
> URL: https://issues.apache.org/jira/browse/SPARK-22656
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> Arrow 0.8.0 will upgrade Netty to 4.1.x and unblock SPARK-19552



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22585) Url encoding of jar path expected?

2017-11-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-22585:


Assignee: Jakub Dubovsky

> Url encoding of jar path expected?
> --
>
> Key: SPARK-22585
> URL: https://issues.apache.org/jira/browse/SPARK-22585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Jakub Dubovsky
>Assignee: Jakub Dubovsky
> Fix For: 2.3.0
>
>
> I am calling {code}sparkContext.addJar{code} method with path to a local jar 
> I want to add. Example:
> {code}/home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar{code}.
>  As a result I get an exception saying
> {code}
> Failed to add 
> /home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar to Spark 
> environment. Stacktrace:
> java.io.FileNotFoundException: Jar 
> /home/me/.coursier/cache/v1/https/artifactory.com:443/path/to.jar not found
> {code}
> Important part to notice here is that colon character is url encoded in path 
> I want to use but exception is complaining about path in decoded form. This 
> is caused by this line of code from implementation ([see 
> here|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/SparkContext.scala#L1833]):
> {code}
> case null | "file" => addJarFile(new File(uri.getPath))
> {code}
> It uses 
> [getPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getPath()]
>  method of 
> [java.net.URI|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html] 
> which url decodes the path. I believe method 
> [getRawPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getRawPath()]
>  should be used here which keeps path string in original form.
> I tend to see this as a bug since I want to use my dependencies resolved from 
> artifactory with port directly. Is there some specific reason for this or can 
> we fix this?
> Thanks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22585) Url encoding of jar path expected?

2017-11-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-22585.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19834
[https://github.com/apache/spark/pull/19834]

> Url encoding of jar path expected?
> --
>
> Key: SPARK-22585
> URL: https://issues.apache.org/jira/browse/SPARK-22585
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Jakub Dubovsky
> Fix For: 2.3.0
>
>
> I am calling {code}sparkContext.addJar{code} method with path to a local jar 
> I want to add. Example:
> {code}/home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar{code}.
>  As a result I get an exception saying
> {code}
> Failed to add 
> /home/me/.coursier/cache/v1/https/artifactory.com%3A443/path/to.jar to Spark 
> environment. Stacktrace:
> java.io.FileNotFoundException: Jar 
> /home/me/.coursier/cache/v1/https/artifactory.com:443/path/to.jar not found
> {code}
> Important part to notice here is that colon character is url encoded in path 
> I want to use but exception is complaining about path in decoded form. This 
> is caused by this line of code from implementation ([see 
> here|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/SparkContext.scala#L1833]):
> {code}
> case null | "file" => addJarFile(new File(uri.getPath))
> {code}
> It uses 
> [getPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getPath()]
>  method of 
> [java.net.URI|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html] 
> which url decodes the path. I believe method 
> [getRawPath|https://docs.oracle.com/javase/7/docs/api/java/net/URI.html#getRawPath()]
>  should be used here which keeps path string in original form.
> I tend to see this as a bug since I want to use my dependencies resolved from 
> artifactory with port directly. Is there some specific reason for this or can 
> we fix this?
> Thanks



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22373) Intermittent NullPointerException in org.codehaus.janino.IClass.isAssignableFrom

2017-11-29 Thread Leigh Klotz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271975#comment-16271975
 ] 

Leigh  Klotz commented on SPARK-22373:
--

[~mshen] Thank you.  I've hand-upgraded janino and commons-compiler to 3.0.7, 
and did no other dependencies.  The NPE has not occurred, and I'm running 
further tests to make sure there are no other ill effects.


> Intermittent NullPointerException in 
> org.codehaus.janino.IClass.isAssignableFrom
> 
>
> Key: SPARK-22373
> URL: https://issues.apache.org/jira/browse/SPARK-22373
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
> Environment: Hortonworks distribution: HDP 2.6.2.0-205 , 
> /usr/hdp/current/spark2-client/jars/spark-core_2.11-2.1.1.2.6.2.0-205.jar
>Reporter: Dan Meany
>Priority: Minor
> Attachments: CodeGeneratorTester.scala, generated.java
>
>
> Very occasional and retry works.
> Full stack:
> 17/10/27 21:06:15 ERROR Executor: Exception in task 29.0 in stage 12.0 (TID 
> 758)
> java.lang.NullPointerException
>   at org.codehaus.janino.IClass.isAssignableFrom(IClass.java:569)
>   at 
> org.codehaus.janino.UnitCompiler.isWideningReferenceConvertible(UnitCompiler.java:10347)
>   at 
> org.codehaus.janino.UnitCompiler.isMethodInvocationConvertible(UnitCompiler.java:8636)
>   at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:8427)
>   at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:8285)
>   at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8169)
>   at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8071)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4421)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:550)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> 

[jira] [Commented] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271944#comment-16271944
 ] 

Sean Owen commented on SPARK-22657:
---

I wouldn't have expected that to work. The user app classloader wouldn't be 
usable to Spark code. Are you saying there's an easy workaround though? sure, 
if so, but I suspect there are other reasons this wouldn't work.

> Hadoop fs implementation classes are not loaded if they are part of the app 
> jar or other jar when --packages flag is used 
> --
>
> Key: SPARK-22657
> URL: https://issues.apache.org/jira/browse/SPARK-22657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Stavros Kontopoulos
>
> To reproduce this issue run:
> ./bin/spark-submit --master mesos://leader.mesos:5050 \
> --packages com.github.scopt:scopt_2.11:3.5.0 \
> --conf spark.cores.max=8 \
> --conf 
> spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
>  \
> --conf spark.mesos.executor.docker.forcePullImage=true \
> --class S3Job 
> http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
>  \
> --readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
> s3n://arand-sandbox-mesosphere/linecount.out
> within a container created with 
> mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 image
> You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
> for scheme: s3n"
> This can be run reproduced with local[*] as well, no need to use mesos, this 
> is not mesos bug.
> The specific spark job used above can be found here: 
> https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
>   
> Can be built with sbt assembly in that dir.
> Using this code : 
> https://gist.github.com/skonto/4f5ff1e5ede864f90b323cc20bf1e1cbat the 
> beginning of the main method...
> you get the following output : 
> https://gist.github.com/skonto/d22b8431586b6663ddd720e179030da4
> (Use 
> http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
>  to to get the modified job)
> The job works fine if --packages is not used.
> The commit that introduced this issue is (before that things work as 
> expected):
> 5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
> glob support for resources adding to Spark [32m(5 months ago) 
> [1;34m[m Thu, 6 Jul 2017 15:32:49 +0800
> The exception comes from here: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311
> https://github.com/apache/spark/pull/18235/files, check line 950, this is 
> where a filesystem is first created.
> The Filesystem class is initialized there, before the main of the spark job 
> is launched... the reason is --packages logic uses hadoop libraries to 
> download files
> Maven resolution happens before the app jar and the resolved jars are added 
> to the classpath. So at that moment there is no s3n to add to the static map 
> when the Filesystem static members are first initialized and also filled due 
> to the first FileSystem instance created (SERVICE_FILE_SYSTEMS).
> Later in the spark job main where we try to access the s3n filesystem (create 
> a second filesystem) we get the exception (at this point the app jar has the 
> s3n implementation in it and its on the class path but that scheme is not 
> loaded in the static map of the Filesystem class)... 
> hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
> problem is with the static map which is filled once and only once.
> That's why we see two prints of the map contents in the output(gist)  above 
> when --packages is used. The first print is before creating the s3n 
> filesystem. We use reflection there to get the static map's entries. When 
> --packages is not used that map is empty before creating the s3n filesystem 
> since up to that point the Filesystem class is not yet loaded by the 
> classloader.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22608) Avoid code duplication regarding CodeGeneration.splitExpressions()

2017-11-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22608.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19821
[https://github.com/apache/spark/pull/19821]

> Avoid code duplication regarding CodeGeneration.splitExpressions()
> --
>
> Key: SPARK-22608
> URL: https://issues.apache.org/jira/browse/SPARK-22608
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.3.0
>
>
> Since several {{CodeGenenerator.splitExpression}} are used with 
> {{ctx.INPUT_ROW}}, it would be good to prepare APIs for this to avoid code 
> duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22608) Avoid code duplication regarding CodeGeneration.splitExpressions()

2017-11-29 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22608:
---

Assignee: Kazuaki Ishizaki

> Avoid code duplication regarding CodeGeneration.splitExpressions()
> --
>
> Key: SPARK-22608
> URL: https://issues.apache.org/jira/browse/SPARK-22608
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.3.0
>
>
> Since several {{CodeGenenerator.splitExpression}} are used with 
> {{ctx.INPUT_ROW}}, it would be good to prepare APIs for this to avoid code 
> duplication.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark

2017-11-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271930#comment-16271930
 ] 

Sean Owen commented on SPARK-22658:
---

I don't see a strong reason this needs to be part of Spark. It shifts 
maintenance to the core project for not much of any gain. It also tends to 
bless a single deep-learning-on-Spark project among several. I would say 'no' 
to this, but instead focus on whatever changes in the core help support 
libraries like this (like the image representation SPIP recently)

> SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark
> 
>
> Key: SPARK-22658
> URL: https://issues.apache.org/jira/browse/SPARK-22658
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Andy Feng
> Attachments: SPIP_ TensorFlowOnSpark.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> TensorFlowOnSpark (TFoS) was released at github for distributed TensorFlow 
> training and inference on Apache Spark clusters. TFoS is designed to:
> * Easily migrate all existing TensorFlow programs with minimum code change;
> * Support all TensorFlow functionalities: synchronous/asynchronous training, 
> model/data parallelism, inference and TensorBoard;
> * Easily integrate with your existing data processing pipelines (ex. Spark 
> SQL) and machine learning algorithms (ex. MLlib);
> * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and 
> Infiniband.
> We propose to merge TFoS into Apache Spark as a scalable deep learning 
> library to:
> * Make deep learning easy for Apache Spark community: Familiar pipeline API 
> for training and inference; Enable TensorFlow training/inference on existing 
> Spark clusters.
> * Further simplify data scientist experience: Ensure compatibility b/w Apache 
> Spark and TFoS; Reduce steps for installation.
> * Help Apache Spark evolutions on deep learning: Establish a design pattern 
> for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL 
> training/inference.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark

2017-11-29 Thread Andy Feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Feng updated SPARK-22658:
--
Attachment: SPIP_ TensorFlowOnSpark.pdf

> SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark
> 
>
> Key: SPARK-22658
> URL: https://issues.apache.org/jira/browse/SPARK-22658
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Andy Feng
> Attachments: SPIP_ TensorFlowOnSpark.pdf
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> TensorFlowOnSpark (TFoS) was released at github for distributed TensorFlow 
> training and inference on Apache Spark clusters. TFoS is designed to:
> * Easily migrate all existing TensorFlow programs with minimum code change;
> * Support all TensorFlow functionalities: synchronous/asynchronous training, 
> model/data parallelism, inference and TensorBoard;
> * Easily integrate with your existing data processing pipelines (ex. Spark 
> SQL) and machine learning algorithms (ex. MLlib);
> * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and 
> Infiniband.
> We propose to merge TFoS into Apache Spark as a scalable deep learning 
> library to:
> * Make deep learning easy for Apache Spark community: Familiar pipeline API 
> for training and inference; Enable TensorFlow training/inference on existing 
> Spark clusters.
> * Further simplify data scientist experience: Ensure compatibility b/w Apache 
> Spark and TFoS; Reduce steps for installation.
> * Help Apache Spark evolutions on deep learning: Establish a design pattern 
> for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL 
> training/inference.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark

2017-11-29 Thread Andy Feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Feng updated SPARK-22658:
--
Description: 
TensorFlowOnSpark (TFoS) was released at github for distributed TensorFlow 
training and inference on Apache Spark clusters. TFoS is designed to:
* Easily migrate all existing TensorFlow programs with minimum code change;
* Support all TensorFlow functionalities: synchronous/asynchronous training, 
model/data parallelism, inference and TensorBoard;
* Easily integrate with your existing data processing pipelines (ex. Spark SQL) 
and machine learning algorithms (ex. MLlib);
* Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband.

We propose to merge TFoS into Apache Spark as a scalable deep learning library 
to:
* Make deep learning easy for Apache Spark community: Familiar pipeline API for 
training and inference; Enable TensorFlow training/inference on existing Spark 
clusters.
* Further simplify data scientist experience: Ensure compatibility b/w Apache 
Spark and TFoS; Reduce steps for installation.
* Help Apache Spark evolutions on deep learning: Establish a design pattern for 
additional frameworks (ex. Caffe, CNTK); Structured streaming for DL 
training/inference.


  was:
SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark

Authors: Lee Yang (Yahoo/Oath), Andrew Feng (Yahoo/Oath)
Background and Motivation
Deep learning has evolved significantly in recent years, and is often 
considered a desired mechanism to gain insight from massive amounts of data. 
TensorFlow is currently the most popular deep learning library, and has been 
adopted by many organizations to solve a variety of use cases. After 
TensorFlow’s initial publication, Google released an enhanced TensorFlow with 
distributed deep learning capabilities in April 2016. 

In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed 
TensorFlow training and inference on Apache Spark clusters. TFoS is designed to:
Easily migrate all existing TensorFlow programs with minimum code change;
Support all TensorFlow functionalities: synchronous/asynchronous training, 
model/data parallelism, inference and TensorBoard;
Easily integrate with your existing data processing pipelines (ex. Spark SQL) 
and machine learning algorithms (ex. MLlib);
Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband.

At Yahoo/Oath, TFoS has become the most popular deep learning framework for 
many types of mission critical use cases, many which use 10’s servers of CPU or 
GPU. Outside Yahoo, TFoS has generated interest from LinkedIn, Paytm Labs, Hops 
Hadoop, Cloudera, MapR and Google. TFoS has become a popular choice for 
distributed TensorFlow applications on Spark clusters. 

We propose to merge TFoS into Apache Spark as a scalable deep learning library 
to:
Make deep learning easy for Apache Spark community
Familiar pipeline API for training and inference
Enable TensorFlow training/inference on existing Spark clusters
Further simplify data scientist experience
Ensure compatibility b/w Apache Spark and TFoS
Reduce steps for installation
Help Apache Spark evolutions on deep learning
Establish a design pattern for additional frameworks (ex. Caffe, BigDL, CNTK) 
Structured streaming for DL training/inference
Target Personas
Data scientists
Data engineers
Library developers
Goals
Spark ML style API for distributed TensorFlow training and inference
Support all types of TensorFlow applications (ex. asynchronous learning, model 
parallelism) and functionalities (ex. TensorBoard)
Support all TensorFlow trained models to be used for scalable inference and 
transfer learning with ZERO custom code
Support all Spark schedulers, including standalone, YARN, and Mesos
Support TensorFlow 1.0 and later
Initially Python API only
Scala and Java API could be added for inference later 
Non-Goals
Deep learning frameworks beyond TensorFlow
Non-distributed TensorFlow applications on Apache Spark (ex. single node, or 
parallel execution for hyper-parameter search)
Proposed API Changes
Pipeline API: TFEstimator
model = TFEstimator(train_fn, tf_args)
  .setInputMapping({“image”: “placeholder_X”,  
“label”: “placeholder_Y”})
  .setModelDir(“my_model_checkpoints”)
  .setSteps(1)
  .setEpochs(10)
  .fit(training_data_frame)
TFEstimator is a Spark ML estimator which launches a TensorFlowOnSpark cluster 
for distributed training. Its constructor TFEstimator(train_fn, tf_args, 
export_fn) accepts the following arguments:
train_fn ... TensorFlow "main" function for training.
tf_args ... Dictionary of arguments specific to TensorFlow "main" function.
export_fn ... TensorFlow function for exporting a saved_model.

TFEstimator has a collection of parameters including
InputMapping … Mapping of input DataFrame column to input tensor
ModelDir … Path to save/load model checkpoints
ExportDir … Directory 

[jira] [Updated] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark

2017-11-29 Thread Andy Feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Feng updated SPARK-22658:
--
Description: 
SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark

Authors: Lee Yang (Yahoo/Oath), Andrew Feng (Yahoo/Oath)
Background and Motivation
Deep learning has evolved significantly in recent years, and is often 
considered a desired mechanism to gain insight from massive amounts of data. 
TensorFlow is currently the most popular deep learning library, and has been 
adopted by many organizations to solve a variety of use cases. After 
TensorFlow’s initial publication, Google released an enhanced TensorFlow with 
distributed deep learning capabilities in April 2016. 

In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed 
TensorFlow training and inference on Apache Spark clusters. TFoS is designed to:
Easily migrate all existing TensorFlow programs with minimum code change;
Support all TensorFlow functionalities: synchronous/asynchronous training, 
model/data parallelism, inference and TensorBoard;
Easily integrate with your existing data processing pipelines (ex. Spark SQL) 
and machine learning algorithms (ex. MLlib);
Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and Infiniband.

At Yahoo/Oath, TFoS has become the most popular deep learning framework for 
many types of mission critical use cases, many which use 10’s servers of CPU or 
GPU. Outside Yahoo, TFoS has generated interest from LinkedIn, Paytm Labs, Hops 
Hadoop, Cloudera, MapR and Google. TFoS has become a popular choice for 
distributed TensorFlow applications on Spark clusters. 

We propose to merge TFoS into Apache Spark as a scalable deep learning library 
to:
Make deep learning easy for Apache Spark community
Familiar pipeline API for training and inference
Enable TensorFlow training/inference on existing Spark clusters
Further simplify data scientist experience
Ensure compatibility b/w Apache Spark and TFoS
Reduce steps for installation
Help Apache Spark evolutions on deep learning
Establish a design pattern for additional frameworks (ex. Caffe, BigDL, CNTK) 
Structured streaming for DL training/inference
Target Personas
Data scientists
Data engineers
Library developers
Goals
Spark ML style API for distributed TensorFlow training and inference
Support all types of TensorFlow applications (ex. asynchronous learning, model 
parallelism) and functionalities (ex. TensorBoard)
Support all TensorFlow trained models to be used for scalable inference and 
transfer learning with ZERO custom code
Support all Spark schedulers, including standalone, YARN, and Mesos
Support TensorFlow 1.0 and later
Initially Python API only
Scala and Java API could be added for inference later 
Non-Goals
Deep learning frameworks beyond TensorFlow
Non-distributed TensorFlow applications on Apache Spark (ex. single node, or 
parallel execution for hyper-parameter search)
Proposed API Changes
Pipeline API: TFEstimator
model = TFEstimator(train_fn, tf_args)
  .setInputMapping({“image”: “placeholder_X”,  
“label”: “placeholder_Y”})
  .setModelDir(“my_model_checkpoints”)
  .setSteps(1)
  .setEpochs(10)
  .fit(training_data_frame)
TFEstimator is a Spark ML estimator which launches a TensorFlowOnSpark cluster 
for distributed training. Its constructor TFEstimator(train_fn, tf_args, 
export_fn) accepts the following arguments:
train_fn ... TensorFlow "main" function for training.
tf_args ... Dictionary of arguments specific to TensorFlow "main" function.
export_fn ... TensorFlow function for exporting a saved_model.

TFEstimator has a collection of parameters including
InputMapping … Mapping of input DataFrame column to input tensor
ModelDir … Path to save/load model checkpoints
ExportDir … Directory to export saved_model
BatchSize … Number of records per batch (default: 100)
ClusterSize … Number of nodes in the cluster (default: 1)
NumPS … Number of PS nodes in cluster (default: 0)
Readers … Number of reader/enqueue threads (default: 1)
Tensorboard … Boolean flag indicating tensorboard launch or not (default: false)
Steps … Maximum number of steps to train (default: 1000)
Epochs … Number of epochs to train (default: 1)
Protocol … Network protocol for Tensorflow (grpc|rdma) (default: grpc)
InputMode … Input data feeding mode (TENSORFLOW, SPARK) (default: SPARK)

TFEstimator.fit(dataset) trains a TensorFlow model based on the given training 
dataset. The training dataset is a Spark DataFrame with columns that will be 
mapped to TensorFlow tensors as specified by InputMapping parameter. 
TFEstimator.fit() returns a TFModel instance representing the trained model, 
backed on disk by a TensorFlow checkpoint or saved_model.
TensorFlow Training Application: train_fun(tf_args, TFContext)
The 1st argument for TFEstimator, train_fun, allows custom TensorFlow 
applications to be easily plugged into the Spark 

[jira] [Updated] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark

2017-11-29 Thread Andy Feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Feng updated SPARK-22658:
--
Description: 
In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed 
TensorFlow training and inference on Apache Spark clusters. TFoS is designed to:
   * Easily migrate all existing TensorFlow programs with minimum code change;
   * Support all TensorFlow functionalities: synchronous/asynchronous training, 
model/data parallelism, inference and TensorBoard;
   * Easily integrate with your existing data processing pipelines (ex. Spark 
SQL) and machine learning algorithms (ex. MLlib);
   * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and 
Infiniband.

We propose to merge TFoS into Apache Spark as a scalable deep learning library 
to:
* Make deep learning easy for Apache Spark community:  Familiar pipeline API 
for training and inference; Enable TensorFlow training/inference on existing 
Spark clusters.
* Further simplify data scientist experience: Ensure compatibility b/w Apache 
Spark and TFoS; Reduce steps for installation.
* Help Apache Spark evolution on deep learning: Establish a design pattern for 
additional frameworks (ex. Caffe, CNTK); Structured streaming for DL 
training/inference.


  was:
In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed 
TensorFlow training and inference on Apache Spark clusters. TFoS is designed to:
   * Easily migrate all existing TensorFlow programs with minimum code change;
   * Support all TensorFlow functionalities: synchronous/asynchronous training, 
model/data parallelism, inference and TensorBoard;
   * Easily integrate with your existing data processing pipelines (ex. Spark 
SQL) and machine learning algorithms (ex. MLlib);
   * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and 
Infiniband.

We propose to merge TFoS into Apache Spark as a scalable deep learning library 
to:
* Make deep learning easy for Apache Spark community:  Familiar pipeline API 
for training and inference; Enable TensorFlow training/inference on existing 
Spark clusters.
* Further simplify data scientist experience: Ensure compatibility b/w Apache 
Spark and TFoS; 
Reduce steps for installation.
* Help Apache Spark evolution on deep learning: Establish a design pattern for 
additional frameworks (ex. Caffe, CNTK); Structured streaming for DL 
training/inference.



> SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark
> 
>
> Key: SPARK-22658
> URL: https://issues.apache.org/jira/browse/SPARK-22658
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Andy Feng
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed 
> TensorFlow training and inference on Apache Spark clusters. TFoS is designed 
> to:
>* Easily migrate all existing TensorFlow programs with minimum code change;
>* Support all TensorFlow functionalities: synchronous/asynchronous 
> training, model/data parallelism, inference and TensorBoard;
>* Easily integrate with your existing data processing pipelines (ex. Spark 
> SQL) and machine learning algorithms (ex. MLlib);
>* Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and 
> Infiniband.
> We propose to merge TFoS into Apache Spark as a scalable deep learning 
> library to:
> * Make deep learning easy for Apache Spark community:  Familiar pipeline API 
> for training and inference; Enable TensorFlow training/inference on existing 
> Spark clusters.
> * Further simplify data scientist experience: Ensure compatibility b/w Apache 
> Spark and TFoS; Reduce steps for installation.
> * Help Apache Spark evolution on deep learning: Establish a design pattern 
> for additional frameworks (ex. Caffe, CNTK); Structured streaming for DL 
> training/inference.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22658) SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib of Apache Spark

2017-11-29 Thread Andy Feng (JIRA)
Andy Feng created SPARK-22658:
-

 Summary: SPIP: TeansorFlowOnSpark as a Scalable Deep Learning Lib 
of Apache Spark
 Key: SPARK-22658
 URL: https://issues.apache.org/jira/browse/SPARK-22658
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: Andy Feng


In Feburary 2017, TensorFlowOnSpark (TFoS) was released for distributed 
TensorFlow training and inference on Apache Spark clusters. TFoS is designed to:
   * Easily migrate all existing TensorFlow programs with minimum code change;
   * Support all TensorFlow functionalities: synchronous/asynchronous training, 
model/data parallelism, inference and TensorBoard;
   * Easily integrate with your existing data processing pipelines (ex. Spark 
SQL) and machine learning algorithms (ex. MLlib);
   * Be easily deployed on cloud or on-premise: CPU & GPU, Ethernet and 
Infiniband.

We propose to merge TFoS into Apache Spark as a scalable deep learning library 
to:
* Make deep learning easy for Apache Spark community:  Familiar pipeline API 
for training and inference; Enable TensorFlow training/inference on existing 
Spark clusters.
* Further simplify data scientist experience: Ensure compatibility b/w Apache 
Spark and TFoS; 
Reduce steps for installation.
* Help Apache Spark evolution on deep learning: Establish a design pattern for 
additional frameworks (ex. Caffe, CNTK); Structured streaming for DL 
training/inference.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/skonto/4f5ff1e5ede864f90b323cc20bf1e1cbat the beginning 
of the main method...
you get the following output : 
https://gist.github.com/skonto/d22b8431586b6663ddd720e179030da4
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty before creating the s3n filesystem since up to that 
point the Filesystem class is not yet loaded by the classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty before creating the s3n filesystem since up to that 
point the Filesystem class is not yet loaded by the classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df at the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found [here] 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found here: 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above can be found [here] 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well, no need to use mesos, this is 
not mesos bug.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is (before that things work as expected):
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5
(Use 
http://s3-eu-west-1.amazonaws.com/fdp-stavros-test/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 to to get the modified job)
The job works fine if --packages is not used.

The commit that introduced this issue is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The job works fine if --packages is not used.

The commit that introduced this issue is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 

[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:

./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6 
image

You will get: "Exception in thread "main" java.io.IOException: No FileSystem 
for scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used above is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  
Can be built with sbt assembly in that dir.

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df add the 
beginning of the main method...
you get the following output : 
https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The job works fine if --packages is not used.

The commit that introduced this issue is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800

The exception comes from here: 
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L3311

https://github.com/apache/spark/pull/18235/files, check line 950, this is where 
a filesystem is first created.

The Filesystem class is initialized there, before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files

Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled due to the 
first FileSystem instance created (SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem (create a 
second filesystem) we get the exception (at this point the app jar has the s3n 
implementation in it and its on the class path but that scheme is not loaded in 
the static map of the Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output(gist)  above 
when --packages is used. The first print is before creating the s3n filesystem. 
We use reflection there to get the static map's entries. When --packages is not 
used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
To reproduce this issue run:
```
./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out
```
within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
You get: "Exception in thread "main" java.io.IOException: No FileSystem for 
scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df
You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The commit that introduced this is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800
https://github.com/apache/spark/pull/18235/files check line 950

The Filesystem class is initialized already before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files
 Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are 

[jira] [Commented] (SPARK-22646) Spark on Kubernetes - basic submission client

2017-11-29 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271818#comment-16271818
 ] 

Yinan Li commented on SPARK-22646:
--

The "Component/s" field should be updated.

> Spark on Kubernetes - basic submission client
> -
>
> Key: SPARK-22646
> URL: https://issues.apache.org/jira/browse/SPARK-22646
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> The submission client is responsible for creating the Kubernetes pod that 
> runs the Spark driver. It is a set of client-side changes to enable the 
> scheduler backend.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-22657:

Description: 
To reproduce this issue run:
```
./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out
```
within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
You get: "Exception in thread "main" java.io.IOException: No FileSystem for 
scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df
You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The commit that introduced this is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800
https://github.com/apache/spark/pull/18235/files check line 950

The Filesystem class is initialized already before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files
 Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled 
(SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem we get 
the exception (at this point the app jar has the s3n implementation in it and 
its on the class path but that scheme is not loaded in the static map of the 
Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output above when 
--packages is used. The first print is before creating the s3n filesystem. We 
use reflection there to get the static map's entries btw. When --packages is 
not used that map is empty since the Filesystem class is not yet loaded by the 
classloader.

  was:
Reproduce run:
./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
You get: "Exception in thread "main" java.io.IOException: No FileSystem for 
scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df
You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The commit that introduced this is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800
https://github.com/apache/spark/pull/18235/files check line 950

The Filesystem class is initialized already before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files
 Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled 
(SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem we get 
the exception (at this point the app jar has the s3n implementation in it and 
its on the class path but that scheme is not loaded in the static map of the 
Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints 

[jira] [Created] (SPARK-22657) Hadoop fs implementation classes are not loaded if they are part of the app jar or other jar when --packages flag is used

2017-11-29 Thread Stavros Kontopoulos (JIRA)
Stavros Kontopoulos created SPARK-22657:
---

 Summary: Hadoop fs implementation classes are not loaded if they 
are part of the app jar or other jar when --packages flag is used 
 Key: SPARK-22657
 URL: https://issues.apache.org/jira/browse/SPARK-22657
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Stavros Kontopoulos


Reproduce run:
./bin/spark-submit --master mesos://leader.mesos:5050 \
--packages com.github.scopt:scopt_2.11:3.5.0 \
--conf spark.cores.max=8 \
--conf 
spark.mesos.executor.docker.image=mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
 \
--conf spark.mesos.executor.docker.forcePullImage=true \
--class S3Job 
http://s3-us-west-2.amazonaws.com/arand-sandbox-mesosphere/dcos-spark-scala-tests-assembly-0.1-SNAPSHOT.jar
 \
--readUrl s3n://arand-sandbox-mesosphere/big.txt --writeUrl 
s3n://arand-sandbox-mesosphere/linecount.out

within a container created with mesosphere/spark:beta-2.1.1-2.2.0-2-hadoop-2.6
You get: "Exception in thread "main" java.io.IOException: No FileSystem for 
scheme: s3n"
This can be run reproduced with local[*] as well.

The specific spark job used is[[ 
https://github.com/mesosphere/spark-build/blob/d5c50e9ae3b1438e0c4ba96ff9f36d5dafb6a466/tests/jobs/scala/src/main/scala/S3Job.scala
 |  here ]].  

Using this code : 
https://gist.github.com/fdp-ci/564befd7747bc037bd6c7415e8d2e0df
You get: https://gist.github.com/fdp-ci/21ae1c415306200a877ee0b4ef805fc5

The commit that introduced this is:
5800144a54f5c0180ccf67392f32c3e8a51119b1[m -[33m[m [SPARK-21012][SUBMIT] Add 
glob support for resources adding to Spark [32m(5 months ago) 
[1;34m[m Thu, 6 Jul 2017 15:32:49 +0800
https://github.com/apache/spark/pull/18235/files check line 950

The Filesystem class is initialized already before the main of the spark job is 
launched... the reason is --packages logic uses hadoop libraries to download 
files
 Maven resolution happens before the app jar and the resolved jars are added to 
the classpath. So at that moment there is no s3n to add to the static map when 
the Filesystem static members are first initialized and also filled 
(SERVICE_FILE_SYSTEMS).

Later in the spark job main where we try to access the s3n filesystem we get 
the exception (at this point the app jar has the s3n implementation in it and 
its on the class path but that scheme is not loaded in the static map of the 
Filesystem class)... 
hadoopConf.set("fs.s3n.impl.disable.cache", "true") has no effect since the 
problem is with the static map which is filled once and only once.
That's why we see two prints of the map contents in the output above when 
--packages is used. The first print is before creating the s3n filesystem. We 
use reflection there to get the static map's entries btw. When --packages is 
not used that map is empty since the Filesystem class is not yet loaded by the 
classloader.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22647) Docker files for image creation

2017-11-29 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271813#comment-16271813
 ] 

Yinan Li commented on SPARK-22647:
--

Note: Some reference Dockerfiles are included in 
https://github.com/apache/spark/pull/19717.

> Docker files for image creation
> ---
>
> Key: SPARK-22647
> URL: https://issues.apache.org/jira/browse/SPARK-22647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> This covers the dockerfiles that need to be shipped to enable the Kubernetes 
> backend for Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22656) Upgrade Arrow to 0.8.0

2017-11-29 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-22656:


 Summary: Upgrade Arrow to 0.8.0
 Key: SPARK-22656
 URL: https://issues.apache.org/jira/browse/SPARK-22656
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Shixiong Zhu


Arrow 0.8.0 will upgrade Netty to 4.1.x and unblock SPARK-19552



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20650) Remove JobProgressListener (and other unneeded classes)

2017-11-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20650.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19750
[https://github.com/apache/spark/pull/19750]

> Remove JobProgressListener (and other unneeded classes)
> ---
>
> Key: SPARK-20650
> URL: https://issues.apache.org/jira/browse/SPARK-20650
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
> Fix For: 2.3.0
>
>
> See spec in parent issue (SPARK-18085) for more details.
> This task tracks removing JobProgressListener and other classes that will be 
> made obsolete by the other changes in this project, and making adjustments to 
> parts of the code that still rely on them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark

2017-11-29 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-18935.

   Resolution: Fixed
 Assignee: Stavros Kontopoulos
Fix Version/s: 2.3.0

> Use Mesos "Dynamic Reservation" resource for Spark
> --
>
> Key: SPARK-18935
> URL: https://issues.apache.org/jira/browse/SPARK-18935
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: jackyoh
>Assignee: Stavros Kontopoulos
> Fix For: 2.3.0
>
>
> I'm running spark on Apache Mesos
> Please follow these steps to reproduce the issue:
> 1. First, run Mesos resource reserve:
> curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d 
> resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]'
>  -X POST http://192.168.1.118:5050/master/reserve
> 2. Then run spark-submit command:
> ./spark-submit --class org.apache.spark.examples.SparkPi --master 
> mesos://192.168.1.118:5050 --conf spark.mesos.role=spark  
> ../examples/jars/spark-examples_2.11-2.0.2.jar 1
> And the console will keep loging same warning message as shown below: 
> 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22655) Fail task instead of complete task silently in PythonRunner during shutdown

2017-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271626#comment-16271626
 ] 

Apache Spark commented on SPARK-22655:
--

User 'icexelloss' has created a pull request for this issue:
https://github.com/apache/spark/pull/19852

> Fail task instead of complete task silently in PythonRunner during shutdown
> ---
>
> Key: SPARK-22655
> URL: https://issues.apache.org/jira/browse/SPARK-22655
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Li Jin
>
> We have observed in our production environment that during Spark shutdown, if 
> there are some active tasks, sometimes they will complete with incorrect 
> results. We've tracked down the issue to a PythonRunner where it is returning 
> partial result instead of throwing exception during Spark shutdown. 
> I think the better way to handle this is to have these tasks fail instead of 
> complete with partial results (complete with partial is always bad IMHO)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22655) Fail task instead of complete task silently in PythonRunner during shutdown

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22655:


Assignee: Apache Spark

> Fail task instead of complete task silently in PythonRunner during shutdown
> ---
>
> Key: SPARK-22655
> URL: https://issues.apache.org/jira/browse/SPARK-22655
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Li Jin
>Assignee: Apache Spark
>
> We have observed in our production environment that during Spark shutdown, if 
> there are some active tasks, sometimes they will complete with incorrect 
> results. We've tracked down the issue to a PythonRunner where it is returning 
> partial result instead of throwing exception during Spark shutdown. 
> I think the better way to handle this is to have these tasks fail instead of 
> complete with partial results (complete with partial is always bad IMHO)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22655) Fail task instead of complete task silently in PythonRunner during shutdown

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22655:


Assignee: (was: Apache Spark)

> Fail task instead of complete task silently in PythonRunner during shutdown
> ---
>
> Key: SPARK-22655
> URL: https://issues.apache.org/jira/browse/SPARK-22655
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Li Jin
>
> We have observed in our production environment that during Spark shutdown, if 
> there are some active tasks, sometimes they will complete with incorrect 
> results. We've tracked down the issue to a PythonRunner where it is returning 
> partial result instead of throwing exception during Spark shutdown. 
> I think the better way to handle this is to have these tasks fail instead of 
> complete with partial results (complete with partial is always bad IMHO)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22655) Fail task instead of complete task silently in PythonRunner during shutdown

2017-11-29 Thread Li Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271620#comment-16271620
 ] 

Li Jin commented on SPARK-22655:


PR: https://github.com/apache/spark/pull/19852

> Fail task instead of complete task silently in PythonRunner during shutdown
> ---
>
> Key: SPARK-22655
> URL: https://issues.apache.org/jira/browse/SPARK-22655
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Li Jin
>
> We have observed in our production environment that during Spark shutdown, if 
> there are some active tasks, sometimes they will complete with incorrect 
> results. We've tracked down the issue to a PythonRunner where it is returning 
> partial result instead of throwing exception during Spark shutdown. 
> I think the better way to handle this is to have these tasks fail instead of 
> complete with partial results (complete with partial is always bad IMHO)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22655) Fail task instead of complete task silently in PythonRunner during shutdown

2017-11-29 Thread Li Jin (JIRA)
Li Jin created SPARK-22655:
--

 Summary: Fail task instead of complete task silently in 
PythonRunner during shutdown
 Key: SPARK-22655
 URL: https://issues.apache.org/jira/browse/SPARK-22655
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0, 2.1.0, 2.0.2
Reporter: Li Jin


We have observed in our production environment that during Spark shutdown, if 
there are some active tasks, sometimes they will complete with incorrect 
results. We've tracked down the issue to a PythonRunner where it is returning 
partial result instead of throwing exception during Spark shutdown. 

I think the better way to handle this is to have these tasks fail instead of 
complete with partial results (complete with partial is always bad IMHO)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22654) Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite

2017-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271365#comment-16271365
 ] 

Apache Spark commented on SPARK-22654:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/19851

> Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite
> ---
>
> Key: SPARK-22654
> URL: https://issues.apache.org/jira/browse/SPARK-22654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> HiveExternalCatalogVersionsSuite has failed a few times apparently after 
> failing to download Spark tarballs from a particular mirror. This could be 
> mitigated with some retry logic, at least.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22654) Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22654:


Assignee: Apache Spark  (was: Sean Owen)

> Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite
> ---
>
> Key: SPARK-22654
> URL: https://issues.apache.org/jira/browse/SPARK-22654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> HiveExternalCatalogVersionsSuite has failed a few times apparently after 
> failing to download Spark tarballs from a particular mirror. This could be 
> mitigated with some retry logic, at least.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22654) Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22654:


Assignee: Sean Owen  (was: Apache Spark)

> Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite
> ---
>
> Key: SPARK-22654
> URL: https://issues.apache.org/jira/browse/SPARK-22654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>
> HiveExternalCatalogVersionsSuite has failed a few times apparently after 
> failing to download Spark tarballs from a particular mirror. This could be 
> mitigated with some retry logic, at least.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22654) Retry download of Spark from ASF mirror in HiveExternalCatalogVersionsSuite

2017-11-29 Thread Sean Owen (JIRA)
Sean Owen created SPARK-22654:
-

 Summary: Retry download of Spark from ASF mirror in 
HiveExternalCatalogVersionsSuite
 Key: SPARK-22654
 URL: https://issues.apache.org/jira/browse/SPARK-22654
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.3.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


HiveExternalCatalogVersionsSuite has failed a few times apparently after 
failing to download Spark tarballs from a particular mirror. This could be 
mitigated with some retry logic, at least.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null

2017-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271254#comment-16271254
 ] 

Apache Spark commented on SPARK-22653:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/19850

> executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap 
> is null
> ---
>
> Key: SPARK-22653
> URL: https://issues.apache.org/jira/browse/SPARK-22653
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>
> In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address 
> (executorRef.address) can be null.
>  val data = new ExecutorData(executorRef, executorRef.address, hostname,
> cores, cores, logUrls)
> At this point the executorRef.address can be null, there is actually code 
> above it that handles this case:
>  // If the executor's rpc env is not listening for incoming connections, 
> `hostPort`
>   // will be null, and the client connection should be used to 
> contact the executor.
>   val executorAddress = if (executorRef.address != null) {
>   executorRef.address
> } else {
>   context.senderAddress
> }
> But it doesn't use executorAddress when it creates the ExecutorData.
> This causes removeExecutor to never remove it properly from 
> addressToExecutorId.
> addressToExecutorId -= executorInfo.executorAddress
> This is also a memory leak and can also call onDisconnected to call 
> disableExecutor when it shouldn't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22653:


Assignee: Apache Spark

> executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap 
> is null
> ---
>
> Key: SPARK-22653
> URL: https://issues.apache.org/jira/browse/SPARK-22653
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address 
> (executorRef.address) can be null.
>  val data = new ExecutorData(executorRef, executorRef.address, hostname,
> cores, cores, logUrls)
> At this point the executorRef.address can be null, there is actually code 
> above it that handles this case:
>  // If the executor's rpc env is not listening for incoming connections, 
> `hostPort`
>   // will be null, and the client connection should be used to 
> contact the executor.
>   val executorAddress = if (executorRef.address != null) {
>   executorRef.address
> } else {
>   context.senderAddress
> }
> But it doesn't use executorAddress when it creates the ExecutorData.
> This causes removeExecutor to never remove it properly from 
> addressToExecutorId.
> addressToExecutorId -= executorInfo.executorAddress
> This is also a memory leak and can also call onDisconnected to call 
> disableExecutor when it shouldn't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22653:


Assignee: (was: Apache Spark)

> executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap 
> is null
> ---
>
> Key: SPARK-22653
> URL: https://issues.apache.org/jira/browse/SPARK-22653
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>
> In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address 
> (executorRef.address) can be null.
>  val data = new ExecutorData(executorRef, executorRef.address, hostname,
> cores, cores, logUrls)
> At this point the executorRef.address can be null, there is actually code 
> above it that handles this case:
>  // If the executor's rpc env is not listening for incoming connections, 
> `hostPort`
>   // will be null, and the client connection should be used to 
> contact the executor.
>   val executorAddress = if (executorRef.address != null) {
>   executorRef.address
> } else {
>   context.senderAddress
> }
> But it doesn't use executorAddress when it creates the ExecutorData.
> This causes removeExecutor to never remove it properly from 
> addressToExecutorId.
> addressToExecutorId -= executorInfo.executorAddress
> This is also a memory leak and can also call onDisconnected to call 
> disableExecutor when it shouldn't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null

2017-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271244#comment-16271244
 ] 

Apache Spark commented on SPARK-22653:
--

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/19849

> executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap 
> is null
> ---
>
> Key: SPARK-22653
> URL: https://issues.apache.org/jira/browse/SPARK-22653
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>
> In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address 
> (executorRef.address) can be null.
>  val data = new ExecutorData(executorRef, executorRef.address, hostname,
> cores, cores, logUrls)
> At this point the executorRef.address can be null, there is actually code 
> above it that handles this case:
>  // If the executor's rpc env is not listening for incoming connections, 
> `hostPort`
>   // will be null, and the client connection should be used to 
> contact the executor.
>   val executorAddress = if (executorRef.address != null) {
>   executorRef.address
> } else {
>   context.senderAddress
> }
> But it doesn't use executorAddress when it creates the ExecutorData.
> This causes removeExecutor to never remove it properly from 
> addressToExecutorId.
> addressToExecutorId -= executorInfo.executorAddress
> This is also a memory leak and can also call onDisconnected to call 
> disableExecutor when it shouldn't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22599) Avoid extra reading for cached table

2017-11-29 Thread Nan Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nan Zhu updated SPARK-22599:

Description: 
In the current implementation of Spark, InMemoryTableExec read all data in a 
cached table, filter CachedBatch according to stats and pass data to the 
downstream operators. This implementation makes it inefficient to reside the 
whole table in memory to serve various queries against different partitions of 
the table, which occupies a certain portion of our users' scenarios.

The following is an example of such a use case:

store_sales is a 1TB-sized table in cloud storage, which is partitioned by 
'location'. The first query, Q1, wants to output several metrics A, B, C for 
all stores in all locations. After that, a small team of 3 data scientists 
wants to do some causal analysis for the sales in different locations. To avoid 
unnecessary I/O and parquet/orc parsing overhead, they want to cache the whole 
table in memory in Q1.

With the current implementation, even any one of the data scientists is only 
interested in one out of three locations, the queries they submit to Spark 
cluster is still reading 1TB data completely.

The reason behind the extra reading operation is that we implement CachedBatch 
as

{code}
case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: 
InternalRow)
{code}

where "stats" is a part of every CachedBatch, so we can only filter batches for 
output of InMemoryTableExec operator by reading all data in in-memory table as 
input. The extra reading would be even more unacceptable when some of the 
table's data is evicted to disks.

We propose to introduce a new type of block, metadata block, for the partitions 
of RDD representing data in the cached table. Every metadata block contains 
stats info for all columns in a partition and is saved to BlockManager when 
executing compute() method for the partition. To minimize the number of bytes 
to read,

More details can be found in design 
doc:https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing

performance test results:

Environment: 6 Executors, each of which has 16 cores 90G memory

dataset: 1T TPCDS data

queries: tested 4 queries (Q19, Q46, Q34, Q27) in 
https://github.com/databricks/spark-sql-perf/blob/c2224f37e50628c5c8691be69414ec7f5a3d919a/src/main/scala/com/databricks/spark/sql/perf/tpcds/ImpalaKitQueries.scala

results: 
https://docs.google.com/spreadsheets/d/1A20LxqZzAxMjW7ptAJZF4hMBaHxKGk3TBEQoAJXfzCI/edit?usp=sharing

  was:
In the current implementation of Spark, InMemoryTableExec read all data in a 
cached table, filter CachedBatch according to stats and pass data to the 
downstream operators. This implementation makes it inefficient to reside the 
whole table in memory to serve various queries against different partitions of 
the table, which occupies a certain portion of our users' scenarios.

The following is an example of such a use case:

store_sales is a 1TB-sized table in cloud storage, which is partitioned by 
'location'. The first query, Q1, wants to output several metrics A, B, C for 
all stores in all locations. After that, a small team of 3 data scientists 
wants to do some causal analysis for the sales in different locations. To avoid 
unnecessary I/O and parquet/orc parsing overhead, they want to cache the whole 
table in memory in Q1.

With the current implementation, even any one of the data scientists is only 
interested in one out of three locations, the queries they submit to Spark 
cluster is still reading 1TB data completely.

The reason behind the extra reading operation is that we implement CachedBatch 
as

{code}
case class CachedBatch(numRows: Int, buffers: Array[Array[Byte]], stats: 
InternalRow)
{code}

where "stats" is a part of every CachedBatch, so we can only filter batches for 
output of InMemoryTableExec operator by reading all data in in-memory table as 
input. The extra reading would be even more unacceptable when some of the 
table's data is evicted to disks.

We propose to introduce a new type of block, metadata block, for the partitions 
of RDD representing data in the cached table. Every metadata block contains 
stats info for all columns in a partition and is saved to BlockManager when 
executing compute() method for the partition. To minimize the number of bytes 
to read,

More details can be found in design 
doc:https://docs.google.com/document/d/1DSiP3ej7Wd2cWUPVrgqAtvxbSlu5_1ZZB6m_2t8_95Q/edit?usp=sharing




> Avoid extra reading for cached table
> 
>
> Key: SPARK-22599
> URL: https://issues.apache.org/jira/browse/SPARK-22599
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Nan Zhu
>
> In the current implementation of Spark, InMemoryTableExec read 

[jira] [Commented] (SPARK-22162) Executors and the driver use inconsistent Job IDs during the new RDD commit protocol

2017-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271154#comment-16271154
 ] 

Apache Spark commented on SPARK-22162:
--

User 'rezasafi' has created a pull request for this issue:
https://github.com/apache/spark/pull/19848

> Executors and the driver use inconsistent Job IDs during the new RDD commit 
> protocol
> 
>
> Key: SPARK-22162
> URL: https://issues.apache.org/jira/browse/SPARK-22162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Reza Safi
>
> After SPARK-18191 commit in pull request 15769, using the new commit protocol 
> it is possible that driver and executors uses different jobIds during a rdd 
> commit.
> In the old code, the variable stageId is part of the closure used to define 
> the task as you can see here:
>  
> [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1098]
> As a result, a TaskAttemptId is constructed in executors using the same 
> "stageId" as the driver, since it is a value that is serialized in the 
> driver. Also the value of stageID is actually the rdd.id which is assigned 
> here: 
> [https://github.com/apache/spark/blob/9c8deef64efee20a0ddc9b612f90e77c80aede60/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1084]
> However, after the change in pull request 15769, the value is no longer part 
> of the task closure, which gets serialized by the driver. Instead, it is 
> pulled from the taskContext as you can see 
> here:[https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R103]
> and then that value is used to construct the TaskAttemptId on the executors: 
> [https://github.com/apache/spark/pull/15769/files#diff-dff185cb90c666bce445e3212a21d765R134]
> taskContext has a stageID value which will be set in DAGScheduler. So after 
> the change unlike the old code which a rdd.id was used, an actual stage.id is 
> used which can be different between executors and the driver since it is no 
> longer serialized.
> In summary, the old code consistently used rddId, and just incorrectly named 
> it "stageId".
> The new code uses a mix of rddId and stageId. There should be a consistent ID 
> between executors and the drivers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22615) Handle more cases in PropagateEmptyRelation

2017-11-29 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22615.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.3.0

> Handle more cases in PropagateEmptyRelation 
> 
>
> Key: SPARK-22615
> URL: https://issues.apache.org/jira/browse/SPARK-22615
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
> Fix For: 2.3.0
>
>
> Currently, in the optimize rule `PropagateEmptyRelation`, the following cases 
> is not handled:
> 1. empty relation as right child in left outer join
> 2. empty relation as left child in right outer join
> 3. empty relation as right child in left semi join
> 4. empty relation as right child in left anti join
> case #1 and #2 can be treated as Cartesian product and cause exception.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22641) Pyspark UDF relying on column added with withColumn after distinct

2017-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271085#comment-16271085
 ] 

Apache Spark commented on SPARK-22641:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/19680

> Pyspark UDF relying on column added with withColumn after distinct
> --
>
> Key: SPARK-22641
> URL: https://issues.apache.org/jira/browse/SPARK-22641
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andrew Duffy
>
> We seem to have found an issue with PySpark UDFs interacting with 
> {{withColumn}} when the UDF depends on the column added in {{withColumn}}, 
> but _only_ if {{withColumn}} is performed after a {{distinct()}}.
> Simplest repro in a local PySpark shell:
> {code}
> import pyspark.sql.functions as F
> @F.udf
> def ident(x):
> return x
> spark.createDataFrame([{'a': '1'}]) \
> .distinct() \
> .withColumn('b', F.lit('qq')) \
> .withColumn('fails_here', ident('b')) \
> .collect()
> {code}
> This fails with the following exception:
> {code}
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: pythonUDF0#13
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:513)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:659)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:164)
> at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
> at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:374)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:422)
> at 
> 

[jira] [Assigned] (SPARK-22641) Pyspark UDF relying on column added with withColumn after distinct

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22641:


Assignee: Apache Spark

> Pyspark UDF relying on column added with withColumn after distinct
> --
>
> Key: SPARK-22641
> URL: https://issues.apache.org/jira/browse/SPARK-22641
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andrew Duffy
>Assignee: Apache Spark
>
> We seem to have found an issue with PySpark UDFs interacting with 
> {{withColumn}} when the UDF depends on the column added in {{withColumn}}, 
> but _only_ if {{withColumn}} is performed after a {{distinct()}}.
> Simplest repro in a local PySpark shell:
> {code}
> import pyspark.sql.functions as F
> @F.udf
> def ident(x):
> return x
> spark.createDataFrame([{'a': '1'}]) \
> .distinct() \
> .withColumn('b', F.lit('qq')) \
> .withColumn('fails_here', ident('b')) \
> .collect()
> {code}
> This fails with the following exception:
> {code}
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: pythonUDF0#13
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:513)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:659)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:164)
> at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
> at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:374)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:422)
> at 
> 

[jira] [Assigned] (SPARK-22641) Pyspark UDF relying on column added with withColumn after distinct

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22641:


Assignee: (was: Apache Spark)

> Pyspark UDF relying on column added with withColumn after distinct
> --
>
> Key: SPARK-22641
> URL: https://issues.apache.org/jira/browse/SPARK-22641
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Andrew Duffy
>
> We seem to have found an issue with PySpark UDFs interacting with 
> {{withColumn}} when the UDF depends on the column added in {{withColumn}}, 
> but _only_ if {{withColumn}} is performed after a {{distinct()}}.
> Simplest repro in a local PySpark shell:
> {code}
> import pyspark.sql.functions as F
> @F.udf
> def ident(x):
> return x
> spark.createDataFrame([{'a': '1'}]) \
> .distinct() \
> .withColumn('b', F.lit('qq')) \
> .withColumn('fails_here', ident('b')) \
> .collect()
> {code}
> This fails with the following exception:
> {code}
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: pythonUDF0#13
> at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:285)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:513)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:659)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:164)
> at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
> at 
> org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
> at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:374)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:422)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
> at 
> 

[jira] [Commented] (SPARK-22625) Properly cleanup inheritable thread-locals

2017-11-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271041#comment-16271041
 ] 

Sean Owen commented on SPARK-22625:
---

I don't have any additional info for you. You can propose a PR. The issue is in 
part caused by a third party library creating threads though. If it's a clean 
improvement to Spark, OK, but not really something to 'work around'.

> Properly cleanup inheritable thread-locals
> --
>
> Key: SPARK-22625
> URL: https://issues.apache.org/jira/browse/SPARK-22625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Tolstopyatov Vsevolod
>  Labels: leak
>
> Memory leak is present due to inherited thread locals, SPARK-20558 didn't 
> fixed it properly.
> Our production application has the following logic: one thread is reading 
> from HDFS and another one creates spark context, processes HDFS files and 
> then closes it on regular schedule.
> Depending on what thread started first, SparkContext thread local may or may 
> not be inherited by HDFS-daemon (DataStreamer), causing memory leak when 
> streamer was created after spark context. Memory consumption increases every 
> time new spark context is created, related yourkit paths: 
> https://screencast.com/t/tgFBYMEpW
> The problem is more general and is not related to HDFS in particular.
> Proper fix: register all cloned properties (in `localProperties#childValue`) 
> in ConcurrentHashMap and forcefully clear all of them in `SparkContext#close`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22625) Properly cleanup inheritable thread-locals

2017-11-29 Thread Tolstopyatov Vsevolod (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271038#comment-16271038
 ] 

Tolstopyatov Vsevolod commented on SPARK-22625:
---

Ping [~srowen]

> Properly cleanup inheritable thread-locals
> --
>
> Key: SPARK-22625
> URL: https://issues.apache.org/jira/browse/SPARK-22625
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Tolstopyatov Vsevolod
>  Labels: leak
>
> Memory leak is present due to inherited thread locals, SPARK-20558 didn't 
> fixed it properly.
> Our production application has the following logic: one thread is reading 
> from HDFS and another one creates spark context, processes HDFS files and 
> then closes it on regular schedule.
> Depending on what thread started first, SparkContext thread local may or may 
> not be inherited by HDFS-daemon (DataStreamer), causing memory leak when 
> streamer was created after spark context. Memory consumption increases every 
> time new spark context is created, related yourkit paths: 
> https://screencast.com/t/tgFBYMEpW
> The problem is more general and is not related to HDFS in particular.
> Proper fix: register all cloned properties (in `localProperties#childValue`) 
> in ConcurrentHashMap and forcefully clear all of them in `SparkContext#close`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22622) OutOfMemory thrown by Closure Serializer without proper failure propagation

2017-11-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22622.
---
Resolution: Duplicate

It looks like, nevertheless, something huge is in your closure. It's probably 
not the data you think you're broadcasting. This is then at best part of 
SPARK-6235.

This error would kill the driver process, unless you are trying to recover it. 
You'd have to post more detail about what you see.

> OutOfMemory thrown by Closure Serializer without proper failure propagation
> ---
>
> Key: SPARK-22622
> URL: https://issues.apache.org/jira/browse/SPARK-22622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Spark 2.2.0
> Hadoop 2.9.0
>Reporter: Raghavendra
>Priority: Critical
>
> While moving from a Stage to another, the Closure serializer is trying to 
> Serialize the Closures and throwing OOMs.
>  This is happening when the RDD size crosses 70 GB. 
> I set the Driver Memory to 225 GB and yet the error persist.
>  There are two issues here
> * OOM thrown when there is almost 3 times of Driver memory provided than the 
> last Stage RDD size.(Even tried caching this into the disk before moving it 
> into the current stage)
> * After the Error is thrown, the Spark Job does not exit. it just continues 
> in the same state without propagating the error into the Spark UI.
> *Scenario 1*
> {color:red}Exception in thread "dag-scheduler-event-loop" 
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>   at java.util.Arrays.copyOf(Arrays.java:3236)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>   at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1003)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1677)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {color}
> *Scenario 2*
> {color:red}
>Exception in thread "dag-scheduler-event-loop" 
> java.lang.OutOfMemoryError
>   at 
> java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123)
>   at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:117)
>   at 
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>   at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
>   at 
> org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
>   at 
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1003)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:874)
>   at 
> 

[jira] [Commented] (SPARK-22636) row count not being set correctly (always 0) after Statement.executeUpdate()

2017-11-29 Thread Sergio Lob (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271013#comment-16271013
 ] 

Sergio Lob commented on SPARK-22636:


OK

> row count not being set correctly (always 0) after Statement.executeUpdate()
> 
>
> Key: SPARK-22636
> URL: https://issues.apache.org/jira/browse/SPARK-22636
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.2.0
> Environment: Linux lnxx64r7 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 
> 11:16:57 EDT 2014 x86_64 x
> 86_64 x86_64 GNU/Linux
>Reporter: Sergio Lob
>Priority: Minor
>
> This is the similar complaint as HIVE-8244



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22634) Update Bouncy castle dependency

2017-11-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271012#comment-16271012
 ] 

Sean Owen commented on SPARK-22634:
---

Right, I see core depends on jets3t, so it's not just provided by Hadoop.
Looks reasonable, but maybe worth bumping jets3t to 0.9.4 as well; it fixes 
some bugs and also ups the version of bouncy castle it wants, which may avoid 
problems with bumping bouncy castle yet further.

> Update Bouncy castle dependency
> ---
>
> Key: SPARK-22634
> URL: https://issues.apache.org/jira/browse/SPARK-22634
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, SQL, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Lior Regev
>Priority: Minor
>
> Spark's usage of jets3t library as well as Spark's own Flume and Kafka 
> streaming uses bouncy castle version 1.51
> This is an outdated version as the latest one is 1.58
> This, in turn renders packages such as 
> [spark-hadoopcryptoledger-ds|https://github.com/ZuInnoTe/spark-hadoopcryptoledger-ds]
>  unusable since these require 1.58 and spark's distributions come along with 
> 1.51
> My own attempt was to run on EMR, and since I automatically get all of 
> spark's dependecies (bouncy castle 1.51 being one of them) into the 
> classpath, using the library to parse blockchain data failed due to missing 
> functionality.
> I have also opened an 
> [issue|https://bitbucket.org/jmurty/jets3t/issues/242/bouncycastle-dependency]
>  with jets3t to update their dependecy as well, but along with that Spark 
> would have to update it's own or at least be packaged with a newer version



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22646) Spark on Kubernetes - basic submission client

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22646:


Assignee: (was: Apache Spark)

> Spark on Kubernetes - basic submission client
> -
>
> Key: SPARK-22646
> URL: https://issues.apache.org/jira/browse/SPARK-22646
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> The submission client is responsible for creating the Kubernetes pod that 
> runs the Spark driver. It is a set of client-side changes to enable the 
> scheduler backend.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22646) Spark on Kubernetes - basic submission client

2017-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16271011#comment-16271011
 ] 

Apache Spark commented on SPARK-22646:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/19717

> Spark on Kubernetes - basic submission client
> -
>
> Key: SPARK-22646
> URL: https://issues.apache.org/jira/browse/SPARK-22646
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> The submission client is responsible for creating the Kubernetes pod that 
> runs the Spark driver. It is a set of client-side changes to enable the 
> scheduler backend.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22646) Spark on Kubernetes - basic submission client

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22646:


Assignee: Apache Spark

> Spark on Kubernetes - basic submission client
> -
>
> Key: SPARK-22646
> URL: https://issues.apache.org/jira/browse/SPARK-22646
> Project: Spark
>  Issue Type: Sub-task
>  Components: Scheduler
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Apache Spark
>
> The submission client is responsible for creating the Kubernetes pod that 
> runs the Spark driver. It is a set of client-side changes to enable the 
> scheduler backend.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22641) Pyspark UDF relying on column added with withColumn after distinct

2017-11-29 Thread Andrew Duffy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Duffy updated SPARK-22641:
-
Description: 
We seem to have found an issue with PySpark UDFs interacting with 
{{withColumn}} when the UDF depends on the column added in {{withColumn}}, but 
_only_ if {{withColumn}} is performed after a {{distinct()}}.

Simplest repro in a local PySpark shell:

{code}
import pyspark.sql.functions as F

@F.udf
def ident(x):
return x

spark.createDataFrame([{'a': '1'}]) \
.distinct() \
.withColumn('b', F.lit('qq')) \
.withColumn('fails_here', ident('b')) \
.collect()
{code}

This fails with the following exception:

{code}
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: pythonUDF0#13
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:513)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:659)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:164)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:374)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:422)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:233)
at 

[jira] [Commented] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null

2017-11-29 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270977#comment-16270977
 ] 

Thomas Graves commented on SPARK-22653:
---

will have patch up shortly

> executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap 
> is null
> ---
>
> Key: SPARK-22653
> URL: https://issues.apache.org/jira/browse/SPARK-22653
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.2.0
>Reporter: Thomas Graves
>
> In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address 
> (executorRef.address) can be null.
>  val data = new ExecutorData(executorRef, executorRef.address, hostname,
> cores, cores, logUrls)
> At this point the executorRef.address can be null, there is actually code 
> above it that handles this case:
>  // If the executor's rpc env is not listening for incoming connections, 
> `hostPort`
>   // will be null, and the client connection should be used to 
> contact the executor.
>   val executorAddress = if (executorRef.address != null) {
>   executorRef.address
> } else {
>   context.senderAddress
> }
> But it doesn't use executorAddress when it creates the ExecutorData.
> This causes removeExecutor to never remove it properly from 
> addressToExecutorId.
> addressToExecutorId -= executorInfo.executorAddress
> This is also a memory leak and can also call onDisconnected to call 
> disableExecutor when it shouldn't.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22653) executorAddress registered in CoarseGrainedSchedulerBackend.executorDataMap is null

2017-11-29 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-22653:
-

 Summary: executorAddress registered in 
CoarseGrainedSchedulerBackend.executorDataMap is null
 Key: SPARK-22653
 URL: https://issues.apache.org/jira/browse/SPARK-22653
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.2.0
Reporter: Thomas Graves


In CoarseGrainedSchedulerBackend.RegisterExecutor the executor data address 
(executorRef.address) can be null.

 val data = new ExecutorData(executorRef, executorRef.address, hostname,
cores, cores, logUrls)

At this point the executorRef.address can be null, there is actually code above 
it that handles this case:

 // If the executor's rpc env is not listening for incoming connections, 
`hostPort`
  // will be null, and the client connection should be used to contact 
the executor.
  val executorAddress = if (executorRef.address != null) {
  executorRef.address
} else {
  context.senderAddress
}

But it doesn't use executorAddress when it creates the ExecutorData.

This causes removeExecutor to never remove it properly from addressToExecutorId.

addressToExecutorId -= executorInfo.executorAddress

This is also a memory leak and can also call onDisconnected to call 
disableExecutor when it shouldn't.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22652) remove set methods in ColumnarRow

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22652:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove set methods in ColumnarRow
> -
>
> Key: SPARK-22652
> URL: https://issues.apache.org/jira/browse/SPARK-22652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22652) remove set methods in ColumnarRow

2017-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270953#comment-16270953
 ] 

Apache Spark commented on SPARK-22652:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19847

> remove set methods in ColumnarRow
> -
>
> Key: SPARK-22652
> URL: https://issues.apache.org/jira/browse/SPARK-22652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22636) row count not being set correctly (always 0) after Statement.executeUpdate()

2017-11-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270956#comment-16270956
 ] 

Sean Owen commented on SPARK-22636:
---

Yes that's my understanding. We can wait a beat here to see if someone with 
more in-depth knowledge of this has a different opinion, but I believe this is 
foremost a Hive issue.

> row count not being set correctly (always 0) after Statement.executeUpdate()
> 
>
> Key: SPARK-22636
> URL: https://issues.apache.org/jira/browse/SPARK-22636
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.2.0
> Environment: Linux lnxx64r7 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 
> 11:16:57 EDT 2014 x86_64 x
> 86_64 x86_64 GNU/Linux
>Reporter: Sergio Lob
>Priority: Minor
>
> This is the similar complaint as HIVE-8244



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22652) remove set methods in ColumnarRow

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22652:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove set methods in ColumnarRow
> -
>
> Key: SPARK-22652
> URL: https://issues.apache.org/jira/browse/SPARK-22652
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22652) remove set methods in ColumnarRow

2017-11-29 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22652:
---

 Summary: remove set methods in ColumnarRow
 Key: SPARK-22652
 URL: https://issues.apache.org/jira/browse/SPARK-22652
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22641) Pyspark UDF relying on column added with withColumn after distinct

2017-11-29 Thread Andrew Duffy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Duffy updated SPARK-22641:
-
Description: 
We seem to have found an issue with PySpark UDFs interacting with 
{{withColumn}} when the UDF depends on the column added in {{withColumn}}, but 
_only_ if {{withColumn}} is performed after a {{distinct()}}.

Simplest repro in a local PySpark shell:

{code}
import pyspark.sql.functions as F

@F.udf
def ident(x):
return x

spark.createDataFrame([{'a': '1'}]) \
.distinct() \
.withColumn('b', F.lit('qq')) \
.withColumn('fails_here', ident('b')) \
.collect()
{code}

This fails with the following exception:

{code}
py4j.protocol.Py4JJavaError: An error occurred while calling 
o263.collectToPython.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
attribute, tree: pythonUDF0#97
at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:514)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$38.apply(HashAggregateExec.scala:513)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultFunction(HashAggregateExec.scala:513)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:659)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:164)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
at 
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:80)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138)
at 
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:80)
at 
org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:374)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:422)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at 

[jira] [Commented] (SPARK-22636) row count not being set correctly (always 0) after Statement.executeUpdate()

2017-11-29 Thread Sergio Lob (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270920#comment-16270920
 ] 

Sergio Lob commented on SPARK-22636:


Since we are using the Hive JDBC driver to access Spark, I guess that if there 
would be a JDBC fix, it would be in the Hive JDBC driver.  Also, since Spark 
mimics Hive's functionality, I suppose you're implying that the functionality 
would have to be implemented in Hive first before being considered for Spark.  
Does that sound correct?

> row count not being set correctly (always 0) after Statement.executeUpdate()
> 
>
> Key: SPARK-22636
> URL: https://issues.apache.org/jira/browse/SPARK-22636
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.2.0
> Environment: Linux lnxx64r7 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 
> 11:16:57 EDT 2014 x86_64 x
> 86_64 x86_64 GNU/Linux
>Reporter: Sergio Lob
>Priority: Minor
>
> This is the similar complaint as HIVE-8244



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-11-29 Thread Mark Petruska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270912#comment-16270912
 ] 

Mark Petruska commented on SPARK-22393:
---

[~srowen], [~rdub], I can now confirm that the original bug fix that was pushed 
to Scala 2.12 fixes this issue. Succeeded in retrofitting the same changes into 
Spark-shell, see: https://github.com/apache/spark/pull/19846.
The original fix for Scala 2.12 can be found at: 
https://github.com/scala/scala/pull/5640
The downside is that the code/fix is not the most approachable, could not 
refactor for better readability (and also making sure it compiles :) ).

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22393:


Assignee: Apache Spark

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>Assignee: Apache Spark
>Priority: Minor
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270903#comment-16270903
 ] 

Apache Spark commented on SPARK-22393:
--

User 'mpetruska' has created a pull request for this issue:
https://github.com/apache/spark/pull/19846

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22393:


Assignee: (was: Apache Spark)

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22393) spark-shell can't find imported types in class constructors, extends clause

2017-11-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270897#comment-16270897
 ] 

Sean Owen commented on SPARK-22393:
---

OK, so this is basically "fixed for Scala 2.12 only"?

> spark-shell can't find imported types in class constructors, extends clause
> ---
>
> Key: SPARK-22393
> URL: https://issues.apache.org/jira/browse/SPARK-22393
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.2, 2.2.0
>Reporter: Ryan Williams
>Priority: Minor
>
> {code}
> $ spark-shell
> …
> scala> import org.apache.spark.Partition
> import org.apache.spark.Partition
> scala> class P(p: Partition)
> :11: error: not found: type Partition
>class P(p: Partition)
>   ^
> scala> class P(val index: Int) extends Partition
> :11: error: not found: type Partition
>class P(val index: Int) extends Partition
>^
> {code}
> Any class that I {{import}} gives "not found: type ___" when used as a 
> parameter to a class, or in an extends clause; this applies to classes I 
> import from JARs I provide via {{--jars}} as well as core Spark classes as 
> above.
> This worked in 1.6.3 but has been broken since 2.0.0.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22636) row count not being set correctly (always 0) after Statement.executeUpdate()

2017-11-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270895#comment-16270895
 ] 

Sean Owen commented on SPARK-22636:
---

No. I'm saying the ticket you linked to does not sound like a bug (though it's 
marked as such and you marked this one as such). It's a behavior change. It's 
also unresolved -- doesn't mean unresolvable, but also means it is not 
something even Hive does now, and Spark generally matches Hive's semantics and 
functionality. I also am not clear where you mean Spark needs to implement 
this. It does not implement a JDBC API like Statement.

> row count not being set correctly (always 0) after Statement.executeUpdate()
> 
>
> Key: SPARK-22636
> URL: https://issues.apache.org/jira/browse/SPARK-22636
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.2.0
> Environment: Linux lnxx64r7 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 
> 11:16:57 EDT 2014 x86_64 x
> 86_64 x86_64 GNU/Linux
>Reporter: Sergio Lob
>Priority: Minor
>
> This is the similar complaint as HIVE-8244



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22636) row count not being set correctly (always 0) after Statement.executeUpdate()

2017-11-29 Thread Sergio Lob (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270865#comment-16270865
 ] 

Sergio Lob commented on SPARK-22636:


Are you implying that its not "fixable" in neither Hive nor Spark?

> row count not being set correctly (always 0) after Statement.executeUpdate()
> 
>
> Key: SPARK-22636
> URL: https://issues.apache.org/jira/browse/SPARK-22636
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 2.2.0
> Environment: Linux lnxx64r7 3.10.0-123.el7.x86_64 #1 SMP Mon May 5 
> 11:16:57 EDT 2014 x86_64 x
> 86_64 x86_64 GNU/Linux
>Reporter: Sergio Lob
>Priority: Minor
>
> This is the similar complaint as HIVE-8244



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22633) spark-submit.cmd cannot handle long arguments

2017-11-29 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270753#comment-16270753
 ] 

Hyukjin Kwon commented on SPARK-22633:
--

{{spark-submit2.cmd}} is only there for the purpose of isolating environment 
problem BTW. Calling {{2.cmd}} script is fine.

> spark-submit.cmd cannot handle long arguments
> -
>
> Key: SPARK-22633
> URL: https://issues.apache.org/jira/browse/SPARK-22633
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.1.1
> Environment: Windows 7 x64
>Reporter: Olivier Sannier
>  Labels: windows
>
> Hello,
> Under Windows, one would use spark-submit.cmd with the parameters required to 
> submit a program to Spark which has the following implementation:
> {{cmd /V /E /C "%~dp0spark-submit2.cmd" %*}}
> This spawns a second shell to ensure changes to the environment are local to 
> the script and do not leak to the caller.
> But this has a major drawback as it hits the 2048 characters limit for a 
> cmd.exe argument:
> https://support.microsoft.com/en-us/help/830473/command-prompt-cmd--exe-command-line-string-limitation
> One workaround is to call {{spark-submit2.cmd}} directly but it means a 
> specific command for Windows usage.
> The other solution is to remove the call to {{cmd}} and replace it with a 
> call to {{setlocal}} before calling {{spark-submit2.cmd}} leading to this 
> code:
> {{setlocal}}
> {{"%~dp0spark-submit2.cmd" %*}}
> Using this here solved the issue altogether but I'm not sure it can be 
> applied to older Windows versions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22651) Calling ImageSchema.readImages initiate multiple Hive clients

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22651:


Assignee: Apache Spark

> Calling ImageSchema.readImages initiate multiple Hive clients
> -
>
> Key: SPARK-22651
> URL: https://issues.apache.org/jira/browse/SPARK-22651
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> While playing with images, I realised calling {{ImageSchema.readImages}} 
> multiple times seems attempting to create multiple Hive clients.
> {code}
> from pyspark.ml.image import ImageSchema
> data_path = 'data/mllib/images/kittens'
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> {code}
> {code}
> ...
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
> ...
>   at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
> ...
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
> ...
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
> ...
>   at 
> org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
>   at 
> org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253)
> ...
> Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   ... 121 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /.../spark/metastore_db.
> ...
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages
> dropImageFailures, float(sampleRatio), seed)
>   File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 
> 1160, in __call__
>   File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22650) spark2.2 on yarn streaming can't connect hbase

2017-11-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22650.
---
Resolution: Invalid

Questions to the mailing list please

> spark2.2  on yarn   streaming can't connect hbase 
> --
>
> Key: SPARK-22650
> URL: https://issues.apache.org/jira/browse/SPARK-22650
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, YARN
>Affects Versions: 2.2.0
> Environment: HDFS 2.6.0+cdh5.5.2+992   
> HttpFS2.6.0+cdh5.5.2+992   
> YARN  2.6.0+cdh5.5.2+992
> HBase 1.0.0+cdh5.5.2+297
> Sparkspark-2.2.0-bin-hadoop2.6
>Reporter: ZHOUBEIHUA
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> HI,
> We can't use Spark streaming to connect hbase  in kerberos  with spark token .
> Can you give some advice to use spark self method non hbase UGI to connect 
> hbase .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22651) Calling ImageSchema.readImages initiate multiple Hive clients

2017-11-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22651:


Assignee: (was: Apache Spark)

> Calling ImageSchema.readImages initiate multiple Hive clients
> -
>
> Key: SPARK-22651
> URL: https://issues.apache.org/jira/browse/SPARK-22651
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> While playing with images, I realised calling {{ImageSchema.readImages}} 
> multiple times seems attempting to create multiple Hive clients.
> {code}
> from pyspark.ml.image import ImageSchema
> data_path = 'data/mllib/images/kittens'
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> {code}
> {code}
> ...
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
> ...
>   at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
> ...
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
> ...
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
> ...
>   at 
> org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
>   at 
> org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253)
> ...
> Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   ... 121 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /.../spark/metastore_db.
> ...
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages
> dropImageFailures, float(sampleRatio), seed)
>   File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 
> 1160, in __call__
>   File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22651) Calling ImageSchema.readImages initiate multiple Hive clients

2017-11-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16270721#comment-16270721
 ] 

Apache Spark commented on SPARK-22651:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19845

> Calling ImageSchema.readImages initiate multiple Hive clients
> -
>
> Key: SPARK-22651
> URL: https://issues.apache.org/jira/browse/SPARK-22651
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> While playing with images, I realised calling {{ImageSchema.readImages}} 
> multiple times seems attempting to create multiple Hive clients.
> {code}
> from pyspark.ml.image import ImageSchema
> data_path = 'data/mllib/images/kittens'
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> {code}
> {code}
> ...
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
> ...
>   at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
> ...
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
> ...
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
> ...
>   at 
> org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
>   at 
> org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253)
> ...
> Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   ... 121 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /.../spark/metastore_db.
> ...
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages
> dropImageFailures, float(sampleRatio), seed)
>   File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 
> 1160, in __call__
>   File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22651) Calling ImageSchema.readImages initiate multiple Hive clients

2017-11-29 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22651:
-
Component/s: ML

> Calling ImageSchema.readImages initiate multiple Hive clients
> -
>
> Key: SPARK-22651
> URL: https://issues.apache.org/jira/browse/SPARK-22651
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>
> While playing with images, I realised calling {{ImageSchema.readImages}} 
> multiple times seems attempting to create multiple Hive clients.
> {code}
> from pyspark.ml.image import ImageSchema
> data_path = 'data/mllib/images/kittens'
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> _ = ImageSchema.readImages(data_path, recursive=True, 
> dropImageFailures=True).collect()
> {code}
> {code}
> ...
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
> ...
>   at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
> ...
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
> ...
>   at 
> org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
> ...
>   at 
> org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
>   at 
> org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253)
> ...
> Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
> loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, see 
> the next exception for details.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   ... 121 more
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /.../spark/metastore_db.
> ...
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages
> dropImageFailures, float(sampleRatio), seed)
>   File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 
> 1160, in __call__
>   File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: 
> java.lang.RuntimeException: Unable to instantiate 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22642) the createdTempDir will not be deleted if an exception occurs

2017-11-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22642:
--
Priority: Minor  (was: Critical)

This is hardly critical

> the createdTempDir will not be deleted if an exception occurs
> -
>
> Key: SPARK-22642
> URL: https://issues.apache.org/jira/browse/SPARK-22642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Priority: Minor
>
> We found staging directories will not be dropped sometimes in our production 
> environment.
> The createdTempDir will not be deleted if an exception occurs, we should 
> delete createdTempDir in finally.
> Refer to SPARK-18703。



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >