date:20180118

[jira] [Comment Edited] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-18 Thread Gaurav (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331847#comment-16331847
 ] 

Gaurav edited comment on SPARK-18016 at 1/19/18 7:31 AM:
-

saving DataFrame having 4 columns in snappy.parquet format in HDFS still 
giving Constant pool exception using "Spark-core 2.4.0-SNAPSHOT" jar mentioned 
below:

*Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
 has grown past JVM limit of 0x*
 *at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)*

 

I have requirement where i am having more than 60K columns wide DataFrame of 
which correlation has to be calculated.


was (Author: gaurav.garg):
saving DataFrame having 4 columns in snappy.parquet format in HDFS still 
giving Constant pool exception using "Spark-core 2.4.0-SNAPSHOT" jar

 

I have requirement where i am having more than 60K columns wide DataFrame of 
which correlation has to be calculated.

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.comp

[jira] [Comment Edited] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-18 Thread Gaurav (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331847#comment-16331847
 ] 

Gaurav edited comment on SPARK-18016 at 1/19/18 7:29 AM:
-

saving DataFrame having 4 columns in snappy.parquet format in HDFS still 
giving Constant pool exception using "Spark-core 2.4.0-SNAPSHOT" jar

 

I have requirement where i am having more than 60K columns wide DataFrame of 
which correlation has to be calculated.


was (Author: gaurav.garg):
saving DataFrame having 4 columns in snappy.parquet format in HDFS still 
giving Constant pool exception using "Spark-core 2.4.0-SNAPSHOT" jar

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemb

[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2018-01-18 Thread Gaurav (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331847#comment-16331847
 ] 

Gaurav commented on SPARK-18016:


saving DataFrame having 4 columns in snappy.parquet format in HDFS still 
giving Constant pool exception using "Spark-core 2.4.0-SNAPSHOT" jar

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus

[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-18 Thread jifei_yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331817#comment-16331817
 ] 

jifei_yang commented on SPARK-23148:


If you must do this, I think it is best to add escaping it.

> spark.read.csv with multiline=true gives FileNotFoundException if path 
> contains spaces
> --
>
> Key: SPARK-23148
> URL: https://issues.apache.org/jira/browse/SPARK-23148
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bogdan Raducanu
>Priority: Major
>
> Repro code:
> {code:java}
> spark.range(10).write.csv("/tmp/a b c/a.csv")
> spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
> 10
> spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
> java.io.FileNotFoundException: File 
> file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
>  does not exist
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2018-01-18 Thread Gera Shegalov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331815#comment-16331815
 ] 

Gera Shegalov commented on SPARK-12963:
---

We hit the same issue on nodes where a process is not allowed to listen on all 
NIC. An easy fix is to make sure that the Driver in ApplicationMaster inherits 
an explicitly configured public hostname of the NodeManager.

> In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' 
> failed  after 16 retries!
> -
>
> Key: SPARK-12963
> URL: https://issues.apache.org/jira/browse/SPARK-12963
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: lichenglin
>Priority: Critical
>
> I have 3 node cluster:namenode second and data1;
> I use this shell to submit job on namenode:
> bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
> --total-executor-cores 5  --master spark://namenode:6066
> hdfs://namenode:9000/sparkjars/spark.jar
> The Driver may be started on the other node such as data1.
> The problem is :
> when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
> the driver will be started with this param such as 
> SPARK_LOCAL_IP=namenode
> but the driver will start at data1,
> the dirver will try to binding the ip 'namenode' on data1.
> so driver will throw exception like this:
>  Service 'Driver' failed  after 16 retries!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-01-18 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331811#comment-16331811
 ] 

Joseph K. Bradley commented on SPARK-23154:
---

CC [~mlnick], [~yanboliang] or others, what do you think?

> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any breaking changes across a 
> minor version or patch version are reported in the Spark version release 
> notes. If a breakage is not reported in release notes, then it should be 
> treated as a bug to be fixed.
> {quote}
> How does this sound?
> Note: We unfortunately don't have tests for backwards compatibility (which 
> has technical hurdles and can be discussed in [SPARK-15573]).  However, we 
> have made efforts to maintain it during PR review and Spark release QA, and 
> most users expect it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2018-01-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331799#comment-16331799
 ] 

Apache Spark commented on SPARK-12963:
--

User 'gerashegalov' has created a pull request for this issue:
https://github.com/apache/spark/pull/20327

> In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' 
> failed  after 16 retries!
> -
>
> Key: SPARK-12963
> URL: https://issues.apache.org/jira/browse/SPARK-12963
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: lichenglin
>Priority: Critical
>
> I have 3 node cluster:namenode second and data1;
> I use this shell to submit job on namenode:
> bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
> --total-executor-cores 5  --master spark://namenode:6066
> hdfs://namenode:9000/sparkjars/spark.jar
> The Driver may be started on the other node such as data1.
> The problem is :
> when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
> the driver will be started with this param such as 
> SPARK_LOCAL_IP=namenode
> but the driver will start at data1,
> the dirver will try to binding the ip 'namenode' on data1.
> so driver will throw exception like this:
>  Service 'Driver' failed  after 16 retries!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23155:


Assignee: (was: Apache Spark)

> YARN-aggregated executor/driver logs appear unavailable when NM is down
> ---
>
> Key: SPARK-23155
> URL: https://issues.apache.org/jira/browse/SPARK-23155
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Priority: Major
>
> Unlike MapReduce JobHistory Server, Spark history server isn't rewriting 
> container log URL's to point to the aggregated yarn.log.server.url location 
> and relies on the NodeManager webUI to trigger a redirect. This fails when 
> the NM is down. Note that NM may be down permanently after decommissioning in 
> traditional environments or when used in a cloud environment such as AWS EMR 
> where either worker nodes are taken away with autoscale, the whole cluster is 
> used to run a single job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23155:


Assignee: Apache Spark

> YARN-aggregated executor/driver logs appear unavailable when NM is down
> ---
>
> Key: SPARK-23155
> URL: https://issues.apache.org/jira/browse/SPARK-23155
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Assignee: Apache Spark
>Priority: Major
>
> Unlike MapReduce JobHistory Server, Spark history server isn't rewriting 
> container log URL's to point to the aggregated yarn.log.server.url location 
> and relies on the NodeManager webUI to trigger a redirect. This fails when 
> the NM is down. Note that NM may be down permanently after decommissioning in 
> traditional environments or when used in a cloud environment such as AWS EMR 
> where either worker nodes are taken away with autoscale, the whole cluster is 
> used to run a single job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down

2018-01-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331776#comment-16331776
 ] 

Apache Spark commented on SPARK-23155:
--

User 'gerashegalov' has created a pull request for this issue:
https://github.com/apache/spark/pull/20326

> YARN-aggregated executor/driver logs appear unavailable when NM is down
> ---
>
> Key: SPARK-23155
> URL: https://issues.apache.org/jira/browse/SPARK-23155
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Priority: Major
>
> Unlike MapReduce JobHistory Server, Spark history server isn't rewriting 
> container log URL's to point to the aggregated yarn.log.server.url location 
> and relies on the NodeManager webUI to trigger a redirect. This fails when 
> the NM is down. Note that NM may be down permanently after decommissioning in 
> traditional environments or when used in a cloud environment such as AWS EMR 
> where either worker nodes are taken away with autoscale, the whole cluster is 
> used to run a single job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22808) saveAsTable() should be marked as deprecated

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22808:


Assignee: Apache Spark

> saveAsTable() should be marked as deprecated
> 
>
> Key: SPARK-22808
> URL: https://issues.apache.org/jira/browse/SPARK-22808
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Jason Vaccaro
>Assignee: Apache Spark
>Priority: Major
>
> As discussed in SPARK-16803, saveAsTable is not supported as a method for 
> writing to Hive and insertInto should be used instead. However, on the java 
> api documentation for version 2.1.1, the saveAsTable method is not marked as 
> deprecated and the programming guides indicate that saveAsTable is the proper 
> way to write to Hive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22808) saveAsTable() should be marked as deprecated

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22808:


Assignee: (was: Apache Spark)

> saveAsTable() should be marked as deprecated
> 
>
> Key: SPARK-22808
> URL: https://issues.apache.org/jira/browse/SPARK-22808
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Jason Vaccaro
>Priority: Major
>
> As discussed in SPARK-16803, saveAsTable is not supported as a method for 
> writing to Hive and insertInto should be used instead. However, on the java 
> api documentation for version 2.1.1, the saveAsTable method is not marked as 
> deprecated and the programming guides indicate that saveAsTable is the proper 
> way to write to Hive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22808) saveAsTable() should be marked as deprecated

2018-01-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331773#comment-16331773
 ] 

Apache Spark commented on SPARK-22808:
--

User 'brandonJY' has created a pull request for this issue:
https://github.com/apache/spark/pull/20325

> saveAsTable() should be marked as deprecated
> 
>
> Key: SPARK-22808
> URL: https://issues.apache.org/jira/browse/SPARK-22808
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.1
>Reporter: Jason Vaccaro
>Priority: Major
>
> As discussed in SPARK-16803, saveAsTable is not supported as a method for 
> writing to Hive and insertInto should be used instead. However, on the java 
> api documentation for version 2.1.1, the saveAsTable method is not marked as 
> deprecated and the programming guides indicate that saveAsTable is the proper 
> way to write to Hive. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down

2018-01-18 Thread Gera Shegalov (JIRA)

Gera Shegalov created SPARK-23155:
-

 Summary: YARN-aggregated executor/driver logs appear unavailable 
when NM is down
 Key: SPARK-23155
 URL: https://issues.apache.org/jira/browse/SPARK-23155
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.2.1
Reporter: Gera Shegalov


Unlike MapReduce JobHistory Server, Spark history server isn't rewriting 
container log URL's to point to the aggregated yarn.log.server.url location and 
relies on the NodeManager webUI to trigger a redirect. This fails when the NM 
is down. Note that NM may be down permanently after decommissioning in 
traditional environments or when used in a cloud environment such as AWS EMR 
where either worker nodes are taken away with autoscale, the whole cluster is 
used to run a single job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22977) DataFrameWriter operations do not show details in SQL tab

2018-01-18 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331754#comment-16331754
 ] 

Yuming Wang commented on SPARK-22977:
-

cc [~cloud_fan]

> DataFrameWriter operations do not show details in SQL tab
> -
>
> Key: SPARK-22977
> URL: https://issues.apache.org/jira/browse/SPARK-22977
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.3.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: after.png, before.png
>
>
> When CreateHiveTableAsSelectCommand or  InsertIntoHiveTable, SQL tab don't 
> show details after 
> [SPARK-20213|https://issues.apache.org/jira/browse/SPARK-20213].
> *Before*:
> !before.png!
> *After*:
> !after.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23097) Migrate text socket source

2018-01-18 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331735#comment-16331735
 ] 

Saisai Shao commented on SPARK-23097:
-

Thanks, will do.

> Migrate text socket source
> --
>
> Key: SPARK-23097
> URL: https://issues.apache.org/jira/browse/SPARK-23097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23097) Migrate text socket source

2018-01-18 Thread Jose Torres (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331726#comment-16331726
 ] 

Jose Torres commented on SPARK-23097:
-

Certainly! There's no specific plan right now. I'm working on a list of 
pointers for migrating sources; shoot me an email if you want a link to a rough 
draft.

> Migrate text socket source
> --
>
> Key: SPARK-23097
> URL: https://issues.apache.org/jira/browse/SPARK-23097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23097) Migrate text socket source

2018-01-18 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331715#comment-16331715
 ] 

Saisai Shao commented on SPARK-23097:
-

Hi [~joseph.torres], would you mind if I take a shot on this issue (in case you 
don't have a plan on it) :).

> Migrate text socket source
> --
>
> Key: SPARK-23097
> URL: https://issues.apache.org/jira/browse/SPARK-23097
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23054) Incorrect results of casting UserDefinedType to String

2018-01-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23054.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.3.0

> Incorrect results of casting UserDefinedType to String
> --
>
> Key: SPARK-23054
> URL: https://issues.apache.org/jira/browse/SPARK-23054
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.3.0
>
>
> {code}
> >>> from pyspark.ml.classification import MultilayerPerceptronClassifier
> >>> from pyspark.ml.linalg import Vectors
> >>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, 
> >>> Vectors.dense([0.0, 1.0]))], ["label", "features"])
> >>> df.selectExpr("CAST(features AS STRING)").show(truncate = False)
> +---+
> |features   |
> +---+
> |[6,1,0,0,280020,2,0,0,0]   |
> |[6,1,0,0,280020,2,0,0,3ff0]|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-01-18 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23154:
--
Description: 
We have (as far as I know) maintained backwards compatibility for ML 
persistence, but this is not documented anywhere.  I'd like us to document it 
(for spark.ml, not for spark.mllib).

I'd recommend something like:
{quote}
In general, MLlib maintains backwards compatibility for ML persistence.  I.e., 
if you save an ML model or Pipeline in one version of Spark, then you should be 
able to load it back and use it in a future version of Spark.  However, there 
are rare exceptions, described below.

Model persistence: Is a model or Pipeline saved using Apache Spark ML 
persistence in Spark version X loadable by Spark version Y?
* Major versions: No guarantees, but best-effort.
* Minor and patch versions: Yes; these are backwards compatible.
* Note about the format: There are no guarantees for a stable persistence 
format, but model loading itself is designed to be backwards compatible.

Model behavior: Does a model or Pipeline in Spark version X behave identically 
in Spark version Y?
* Major versions: No guarantees, but best-effort.
* Minor and patch versions: Identical behavior, except for bug fixes.

For both model persistence and model behavior, any breaking changes across a 
minor version or patch version are reported in the Spark version release notes. 
If a breakage is not reported in release notes, then it should be treated as a 
bug to be fixed.
{quote}

How does this sound?

Note: We unfortunately don't have tests for backwards compatibility (which has 
technical hurdles and can be discussed in [SPARK-15573]).  However, we have 
made efforts to maintain it during PR review and Spark release QA, and most 
users expect it.

  was:
We have (as far as I know) maintained backwards compatibility for ML 
persistence, but this is not documented anywhere.  I'd like us to document it 
(for spark.ml, not for spark.mllib).

I'd recommend something like:
{quote}
In general, MLlib maintains backwards compatibility for ML persistence.  I.e., 
if you save an ML model or Pipeline in one version of Spark, then you should be 
able to load it back and use it in a future version of Spark.  However, there 
are rare exceptions, described below.

Model persistence: Is a model or Pipeline saved using Apache Spark ML 
persistence in Spark version X loadable by Spark version Y?
* Major versions: No guarantees, but best-effort.
* Minor and patch versions: Yes; these are backwards compatible.
* Note about the format: There are no guarantees for a stable persistence 
format, but model loading itself is designed to be backwards compatible.

Model behavior: Does a model or Pipeline in Spark version X behave identically 
in Spark version Y?
* Major versions: No guarantees, but best-effort.
* Minor and patch versions: Identical behavior, except for bug fixes.

For both model persistence and model behavior, any breaking changes across a 
minor version or patch version are reported in the Spark version release notes. 
If a breakage is not reported in release notes, then it should be treated as a 
bug to be fixed.
{quote}

How does this sound?


> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any brea

[jira] [Created] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-01-18 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-23154:
-

 Summary: Document backwards compatibility guarantees for ML 
persistence
 Key: SPARK-23154
 URL: https://issues.apache.org/jira/browse/SPARK-23154
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


We have (as far as I know) maintained backwards compatibility for ML 
persistence, but this is not documented anywhere.  I'd like us to document it 
(for spark.ml, not for spark.mllib).

I'd recommend something like:
{quote}
In general, MLlib maintains backwards compatibility for ML persistence.  I.e., 
if you save an ML model or Pipeline in one version of Spark, then you should be 
able to load it back and use it in a future version of Spark.  However, there 
are rare exceptions, described below.

Model persistence: Is a model or Pipeline saved using Apache Spark ML 
persistence in Spark version X loadable by Spark version Y?
* Major versions: No guarantees, but best-effort.
* Minor and patch versions: Yes; these are backwards compatible.
* Note about the format: There are no guarantees for a stable persistence 
format, but model loading itself is designed to be backwards compatible.

Model behavior: Does a model or Pipeline in Spark version X behave identically 
in Spark version Y?
* Major versions: No guarantees, but best-effort.
* Minor and patch versions: Identical behavior, except for bug fixes.

For both model persistence and model behavior, any breaking changes across a 
minor version or patch version are reported in the Spark version release notes. 
If a breakage is not reported in release notes, then it should be treated as a 
bug to be fixed.
{quote}

How does this sound?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23091) Incorrect unit test for approxQuantile

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23091:


Assignee: Apache Spark

> Incorrect unit test for approxQuantile
> --
>
> Key: SPARK-23091
> URL: https://issues.apache.org/jira/browse/SPARK-23091
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Tests
>Affects Versions: 2.2.1
>Reporter: Kuang Chen
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, test for `approxQuantile` (quantile estimation algorithm) checks 
> whether estimated quantile is within +- 2*`relativeError` from the true 
> quantile. See the code below:
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157]
> However, based on the original paper by Greenwald and Khanna, the estimated 
> quantile is guaranteed to be within +- `relativeError` from the true 
> quantile. Using the double "tolerance" is misleading and incorrect, and we 
> should fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-01-18 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331542#comment-16331542
 ] 

Saisai Shao commented on SPARK-23128:
-

I'm wondering if we need a SPIP for such changes?

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23091) Incorrect unit test for approxQuantile

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23091:


Assignee: (was: Apache Spark)

> Incorrect unit test for approxQuantile
> --
>
> Key: SPARK-23091
> URL: https://issues.apache.org/jira/browse/SPARK-23091
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Tests
>Affects Versions: 2.2.1
>Reporter: Kuang Chen
>Priority: Minor
>
> Currently, test for `approxQuantile` (quantile estimation algorithm) checks 
> whether estimated quantile is within +- 2*`relativeError` from the true 
> quantile. See the code below:
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157]
> However, based on the original paper by Greenwald and Khanna, the estimated 
> quantile is guaranteed to be within +- `relativeError` from the true 
> quantile. Using the double "tolerance" is misleading and incorrect, and we 
> should fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23091) Incorrect unit test for approxQuantile

2018-01-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331541#comment-16331541
 ] 

Apache Spark commented on SPARK-23091:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/20324

> Incorrect unit test for approxQuantile
> --
>
> Key: SPARK-23091
> URL: https://issues.apache.org/jira/browse/SPARK-23091
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Tests
>Affects Versions: 2.2.1
>Reporter: Kuang Chen
>Priority: Minor
>
> Currently, test for `approxQuantile` (quantile estimation algorithm) checks 
> whether estimated quantile is within +- 2*`relativeError` from the true 
> quantile. See the code below:
> [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157]
> However, based on the original paper by Greenwald and Khanna, the estimated 
> quantile is guaranteed to be within +- `relativeError` from the true 
> quantile. Using the double "tolerance" is misleading and incorrect, and we 
> should fix it.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23142) Add documentation for Continuous Processing

2018-01-18 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23142.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   3.0.0

Issue resolved by pull request 20308
[https://github.com/apache/spark/pull/20308]

> Add documentation for Continuous Processing
> ---
>
> Key: SPARK-23142
> URL: https://issues.apache.org/jira/browse/SPARK-23142
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22962:
--

Assignee: Yinan Li

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Yinan Li
>Priority: Major
> Fix For: 2.3.0
>
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22962.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20320
[https://github.com/apache/spark/pull/20320]

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Yinan Li
>Priority: Major
> Fix For: 2.3.0
>
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail

2018-01-18 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23094.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20302
[https://github.com/apache/spark/pull/20302]

> Json Readers choose wrong encoding when bad records are present and fail
> 
>
> Key: SPARK-23094
> URL: https://issues.apache.org/jira/browse/SPARK-23094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.3.0
>
>
> The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser 
> code paths for expressions but not the readers. We should also cover reader 
> code paths reading files with bad characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail

2018-01-18 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23094:


Assignee: Burak Yavuz

> Json Readers choose wrong encoding when bad records are present and fail
> 
>
> Key: SPARK-23094
> URL: https://issues.apache.org/jira/browse/SPARK-23094
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 2.3.0
>
>
> The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser 
> code paths for expressions but not the readers. We should also cover reader 
> code paths reading files with bad characters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22216) Improving PySpark/Pandas interoperability

2018-01-18 Thread Li Jin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Jin updated SPARK-22216:
---
Description: This is an umbrella ticket tracking the general effort to 
improve performance and interoperability between PySpark and Pandas. The core 
idea is to Apache Arrow as serialization format to reduce the overhead between 
PySpark and Pandas.  (was: This is an umbrella ticket tracking the general 
effect of improving performance and interoperability between PySpark and 
Pandas. The core idea is to Apache Arrow as serialization format to reduce the 
overhead between PySpark and Pandas.)

> Improving PySpark/Pandas interoperability
> -
>
> Key: SPARK-22216
> URL: https://issues.apache.org/jira/browse/SPARK-22216
> Project: Spark
>  Issue Type: Epic
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Li Jin
>Assignee: Li Jin
>Priority: Major
>
> This is an umbrella ticket tracking the general effort to improve performance 
> and interoperability between PySpark and Pandas. The core idea is to Apache 
> Arrow as serialization format to reduce the overhead between PySpark and 
> Pandas.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23133) Spark options are not passed to the Executor in Docker context

2018-01-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23133:
--

Assignee: Andrew Korzhuev

> Spark options are not passed to the Executor in Docker context
> --
>
> Key: SPARK-23133
> URL: https://issues.apache.org/jira/browse/SPARK-23133
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8s using supplied Docker image.
>Reporter: Andrew Korzhuev
>Assignee: Andrew Korzhuev
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Reproduce:
>  # Build image with `bin/docker-image-tool.sh`.
>  # Submit application to k8s. Set executor options, e.g. ` --conf 
> "spark.executor. 
> extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"`
>  # Visit Spark UI on executor and notice that option is not set.
> Expected behavior: options from spark-submit should be correctly passed to 
> executor.
> Cause:
> `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh`
> https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70
> [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23133) Spark options are not passed to the Executor in Docker context

2018-01-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23133.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20322
[https://github.com/apache/spark/pull/20322]

> Spark options are not passed to the Executor in Docker context
> --
>
> Key: SPARK-23133
> URL: https://issues.apache.org/jira/browse/SPARK-23133
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8s using supplied Docker image.
>Reporter: Andrew Korzhuev
>Priority: Minor
> Fix For: 2.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Reproduce:
>  # Build image with `bin/docker-image-tool.sh`.
>  # Submit application to k8s. Set executor options, e.g. ` --conf 
> "spark.executor. 
> extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"`
>  # Visit Spark UI on executor and notice that option is not set.
> Expected behavior: options from spark-submit should be correctly passed to 
> executor.
> Cause:
> `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh`
> https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70
> [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23153) Support application dependencies in submission client's local file system

2018-01-18 Thread Yinan Li (JIRA)

Yinan Li created SPARK-23153:


 Summary: Support application dependencies in submission client's 
local file system
 Key: SPARK-23153
 URL: https://issues.apache.org/jira/browse/SPARK-23153
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Yinan Li






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23133) Spark options are not passed to the Executor in Docker context

2018-01-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331241#comment-16331241
 ] 

Apache Spark commented on SPARK-23133:
--

User 'foxish' has created a pull request for this issue:
https://github.com/apache/spark/pull/20322

> Spark options are not passed to the Executor in Docker context
> --
>
> Key: SPARK-23133
> URL: https://issues.apache.org/jira/browse/SPARK-23133
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8s using supplied Docker image.
>Reporter: Andrew Korzhuev
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Reproduce:
>  # Build image with `bin/docker-image-tool.sh`.
>  # Submit application to k8s. Set executor options, e.g. ` --conf 
> "spark.executor. 
> extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"`
>  # Visit Spark UI on executor and notice that option is not set.
> Expected behavior: options from spark-submit should be correctly passed to 
> executor.
> Cause:
> `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh`
> https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70
> [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-01-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-19185:
--

Assignee: Marcelo Vanzin

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Assignee: Marcelo Vanzin
>Priority: Major
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd

[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing

2018-01-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-19185:
--

Assignee: (was: Marcelo Vanzin)

> ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
> -
>
> Key: SPARK-19185
> URL: https://issues.apache.org/jira/browse/SPARK-19185
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
> Environment: Spark 2.0.2
> Spark Streaming Kafka 010
> Mesos 0.28.0 - client mode
> spark.executor.cores 1
> spark.mesos.extra.cores 1
>Reporter: Kalvin Chau
>Priority: Major
>  Labels: streaming, windowing
>
> We've been running into ConcurrentModificationExcpetions "KafkaConsumer is 
> not safe for multi-threaded access" with the CachedKafkaConsumer. I've been 
> working through debugging this issue and after looking through some of the 
> spark source code I think this is a bug.
> Our set up is:
> Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using 
> Spark-Streaming-Kafka-010
> spark.executor.cores 1
> spark.mesos.extra.cores 1
> Batch interval: 10s, window interval: 180s, and slide interval: 30s
> We would see the exception when in one executor there are two task worker 
> threads assigned the same Topic+Partition, but a different set of offsets.
> They would both get the same CachedKafkaConsumer, and whichever task thread 
> went first would seek and poll for all the records, and at the same time the 
> second thread would try to seek to its offset but fail because it is unable 
> to acquire the lock.
> Time0 E0 Task0 - TopicPartition("abc", 0) X to Y
> Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z
> Time1 E0 Task0 - Seeks and starts to poll
> Time1 E0 Task1 - Attempts to seek, but fails
> Here are some relevant logs:
> {code}
> 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394204414 -> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing 
> topic test-topic, partition 2 offsets 4394238058 -> 4394257712
> 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394204414
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested 
> 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: 
> Initial fetch for spark-executor-consumer test-topic 2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: 
> Seeking to test-topic-2 4394238058
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting 
> block rdd_199_2 failed due to an exception
> 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block 
> rdd_199_2 could not be removed as it was not found on disk or in memory
> 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in 
> task 49.0 in stage 45.0 (TID 3201)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for 
> multi-threaded access
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
>   at 
> org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95)
>   at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
>   at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala

[jira] [Assigned] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23152:


Assignee: Apache Spark

> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Assignee: Apache Spark
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
>   
> The problem happens due to an incorrect guard condition in function 
> getNumClasses at org.apache.spark.ml.classification.Classifier:106
> {code:java}
> val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
> if (maxLabelRow.isEmpty) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
> When the input data is empty the result "maxLabelRow" array is not. Instead 
> it contains a single Row(null) element.
>  
> Proposed solution: the condition can be modified to verify that.
> {code:java}
> if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23152:


Assignee: (was: Apache Spark)

> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
>   
> The problem happens due to an incorrect guard condition in function 
> getNumClasses at org.apache.spark.ml.classification.Classifier:106
> {code:java}
> val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
> if (maxLabelRow.isEmpty) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
> When the input data is empty the result "maxLabelRow" array is not. Instead 
> it contains a single Row(null) element.
>  
> Proposed solution: the condition can be modified to verify that.
> {code:java}
> if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331194#comment-16331194
 ] 

Apache Spark commented on SPARK-23152:
--

User 'tovbinm' has created a pull request for this issue:
https://github.com/apache/spark/pull/20321

> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
>   
> The problem happens due to an incorrect guard condition in function 
> getNumClasses at org.apache.spark.ml.classification.Classifier:106
> {code:java}
> val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
> if (maxLabelRow.isEmpty) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
> When the input data is empty the result "maxLabelRow" array is not. Instead 
> it contains a single Row(null) element.
>  
> Proposed solution: the condition can be modified to verify that.
> {code:java}
> if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
>   throw new SparkException("ML algorithm was given empty dataset.")
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22962:


Assignee: (was: Apache Spark)

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331185#comment-16331185
 ] 

Apache Spark commented on SPARK-22962:
--

User 'liyinan926' has created a pull request for this issue:
https://github.com/apache/spark/pull/20320

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22962:


Assignee: Apache Spark

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23144) Add console sink for continuous queries

2018-01-18 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23144.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   3.0.0

Issue resolved by pull request 20311
[https://github.com/apache/spark/pull/20311]

> Add console sink for continuous queries
> ---
>
> Key: SPARK-23144
> URL: https://issues.apache.org/jira/browse/SPARK-23144
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a misleading 
NullPointerException is thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in function 
getNumClasses at org.apache.spark.ml.classification.Classifier:106
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the result "maxLabelRow" array is not. Instead it 
contains a single Row(null) element.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a misleading 
NullPointerException is thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the result "maxLabelRow" array is not. Instead it 
contains a single Row(null) element.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a misleading 
NullPointerException is thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the result "maxLabelRow" array is not. Instead it 
contains a single Row(null) element.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the result "maxLabelRow" array is not. Instead it 
contains a single Row(null) element.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a misleading 
> NullPointerException is thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.De

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the result "maxLabelRow" array is not. Instead it 
contains a single Row(null) element.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClass

[jira] [Assigned] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22884:


Assignee: (was: Apache Spark)

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTree

[jira] [Commented] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-01-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331157#comment-16331157
 ] 

Apache Spark commented on SPARK-22884:
--

User 'smurakozi' has created a pull request for this issue:
https://github.com/apache/spark/pull/20319

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22884:


Assignee: Apache Spark

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331153#comment-16331153
 ] 

Anirudh Ramanathan commented on SPARK-22962:


I think this isn't super critical for this release, mostly a usability thing. 
If it's small enough, it makes sense, but if it introduces risk and we have to 
redo manual testing, I'd vote against getting this into 2.3.

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23143) Add Python support for continuous trigger

2018-01-18 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23143.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   3.0.0

Issue resolved by pull request 20309
[https://github.com/apache/spark/pull/20309]

> Add Python support for continuous trigger
> -
>
> Key: SPARK-23143
> URL: https://issues.apache.org/jira/browse/SPARK-23143
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || (maxLabelRow.size == 1 && maxLabelRow(0) == null)) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifie

[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming

2018-01-18 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331141#comment-16331141
 ] 

Michael Armbrust commented on SPARK-20928:
--

There is more work to do so I might leave the umbrella open, but we are going 
to have an initial version that supports reading and writing from/to Kafka with 
very low latency in Spark 2.3!  Stay tuned for updates to docs and blog posts.

> SPIP: Continuous Processing Mode for Structured Streaming
> -
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>Priority: Major
>  Labels: SPIP
> Attachments: Continuous Processing in Structured Streaming Design 
> Sketch.pdf
>
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331140#comment-16331140
 ] 

Marcelo Vanzin commented on SPARK-22962:


Yes send a PR.

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Yinan Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331132#comment-16331132
 ] 

Yinan Li commented on SPARK-22962:
--

I agree that before we upstream the staging server, we should fail the 
submission if a user uses local resources. [~vanzin], if it's not too late to 
get into 2.3, I'm gonna file a PR for this.

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. Instead it 
contains a single null element WrappedArray(null).

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.isEmpty || (maxLabelRow.size == 1 && maxLabelRow(0) == null)) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(De

[jira] [Updated] (SPARK-23146) Support client mode for Kubernetes cluster backend

2018-01-18 Thread Anirudh Ramanathan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-23146:
---
Target Version/s: 2.4.0

> Support client mode for Kubernetes cluster backend
> --
>
> Key: SPARK-23146
> URL: https://issues.apache.org/jira/browse/SPARK-23146
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> This issue tracks client mode support within Spark when running in the 
> Kubernetes cluster backend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Anirudh Ramanathan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22962:
---
Affects Version/s: (was: 2.4.0)
   2.3.0

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23146) Support client mode for Kubernetes cluster backend

2018-01-18 Thread Anirudh Ramanathan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-23146:
---
Affects Version/s: (was: 2.4.0)
   2.3.0

> Support client mode for Kubernetes cluster backend
> --
>
> Key: SPARK-23146
> URL: https://issues.apache.org/jira/browse/SPARK-23146
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>
> This issue tracks client mode support within Spark when running in the 
> Kubernetes cluster backend.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Anirudh Ramanathan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-22962:
---
Affects Version/s: (was: 2.3.0)
   2.4.0

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used

2018-01-18 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331085#comment-16331085
 ] 

Anirudh Ramanathan commented on SPARK-22962:


This is the resource staging server use-case. We'll upstream this in the 2.4.0 
timeframe.

> Kubernetes app fails if local files are used
> 
>
> Key: SPARK-22962
> URL: https://issues.apache.org/jira/browse/SPARK-22962
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Major
>
> If you try to start a Spark app on kubernetes using a local file as the app 
> resource, for example, it will fail:
> {code}
> ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar
> {code}
> {noformat}
> + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && 
> env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g'
> \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < 
> /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x}
>  ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi &&   
>   if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP
> ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if 
> ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK
> _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java 
> "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR
> Y -Xmx$SPARK_DRIVER_MEMORY 
> -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS 
> $SPARK_DRIVER_ARGS'
> Error: Could not find or load main class com.cloudera.spark.tests.Sleeper
> {noformat}
> Using an http server to provide the app jar solves the problem.
> The k8s backend should either somehow make these files available to the 
> cluster or error out with a more user-friendly message if that feature is not 
> yet available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23016) Spark UI access and documentation

2018-01-18 Thread Anirudh Ramanathan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan resolved SPARK-23016.

Resolution: Fixed

This is resolved and we've verified it for 2.3.0.

> Spark UI access and documentation
> -
>
> Key: SPARK-23016
> URL: https://issues.apache.org/jira/browse/SPARK-23016
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Minor
>
> We should have instructions to access the spark driver UI, or instruct users 
> to create a service to expose it.
> Also might need an integration test to verify that the driver UI works as 
> expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes

2018-01-18 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331081#comment-16331081
 ] 

Anirudh Ramanathan edited comment on SPARK-23082 at 1/18/18 7:52 PM:
-

This is an interesting feature request. We had some discussion about this. I 
think it's a bit too late for a feature request in 2.3, so, we can revisit this 
in the 2.4 timeframe.

 

[~mcheah]


was (Author: foxish):
This is an interesting feature request. We had some discussion about this. I 
think it's a bit too late for a feature request in 2.3, so, we can revisit this 
in the 2.4 timeframe.

> Allow separate node selectors for driver and executors in Kubernetes
> 
>
> Key: SPARK-23082
> URL: https://issues.apache.org/jira/browse/SPARK-23082
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark 
> driver to a different set of nodes from its executors. In Kubernetes, we can 
> specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use 
> separate options for the driver and executors.
> This would be useful for the particular use case where executors can go on 
> more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot 
> instances), but the driver should use a more persistent machine.
> The required change would be minimal, essentially just using different config 
> keys for the 
> [driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90]
>  and 
> [executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73]
>  instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes

2018-01-18 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331081#comment-16331081
 ] 

Anirudh Ramanathan commented on SPARK-23082:


This is an interesting feature request. We had some discussion about this. I 
think it's a bit too late for a feature request in 2.3, so, we can revisit this 
in the 2.4 timeframe.

> Allow separate node selectors for driver and executors in Kubernetes
> 
>
> Key: SPARK-23082
> URL: https://issues.apache.org/jira/browse/SPARK-23082
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Submit
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Oz Ben-Ami
>Priority: Minor
>
> In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark 
> driver to a different set of nodes from its executors. In Kubernetes, we can 
> specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use 
> separate options for the driver and executors.
> This would be useful for the particular use case where executors can go on 
> more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot 
> instances), but the driver should use a more persistent machine.
> The required change would be minimal, essentially just using different config 
> keys for the 
> [driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90]
>  and 
> [executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73]
>  instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23133) Spark options are not passed to the Executor in Docker context

2018-01-18 Thread Anirudh Ramanathan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331074#comment-16331074
 ] 

Anirudh Ramanathan commented on SPARK-23133:


Thanks for submitting the PR fixing this.

> Spark options are not passed to the Executor in Docker context
> --
>
> Key: SPARK-23133
> URL: https://issues.apache.org/jira/browse/SPARK-23133
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: Running Spark on K8s using supplied Docker image.
>Reporter: Andrew Korzhuev
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Reproduce:
>  # Build image with `bin/docker-image-tool.sh`.
>  # Submit application to k8s. Set executor options, e.g. ` --conf 
> "spark.executor. 
> extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"`
>  # Visit Spark UI on executor and notice that option is not set.
> Expected behavior: options from spark-submit should be correctly passed to 
> executor.
> Cause:
> `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh`
> https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70
> [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])
new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it.

 

Proposed solution: the condition can be modified to verify that.
{code:java}
if (maxLabelRow.size <= 1) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it. Therefore the condition has to be 
modified to verify that.
{code:java}
maxLabelRow.size <= 1 
{code}
 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier()
>  .setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)

[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming

2018-01-18 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331045#comment-16331045
 ] 

Thomas Graves commented on SPARK-20928:
---

what is status of this, it looks like subtasks are finished?

> SPIP: Continuous Processing Mode for Structured Streaming
> -
>
> Key: SPARK-20928
> URL: https://issues.apache.org/jira/browse/SPARK-20928
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Michael Armbrust
>Priority: Major
>  Labels: SPIP
> Attachments: Continuous Processing in Structured Streaming Design 
> Sketch.pdf
>
>
> Given the current Source API, the minimum possible latency for any record is 
> bounded by the amount of time that it takes to launch a task.  This 
> limitation is a result of the fact that {{getBatch}} requires us to know both 
> the starting and the ending offset, before any tasks are launched.  In the 
> worst case, the end-to-end latency is actually closer to the average batch 
> time + task launching time.
> For applications where latency is more important than exactly-once output 
> however, it would be useful if processing could happen continuously.  This 
> would allow us to achieve fully pipelined reading and writing from sources 
> such as Kafka.  This kind of architecture would make it possible to process 
> records with end-to-end latencies on the order of 1 ms, rather than the 
> 10-100ms that is possible today.
> One possible architecture here would be to change the Source API to look like 
> the following rough sketch:
> {code}
>   trait Epoch {
> def data: DataFrame
> /** The exclusive starting position for `data`. */
> def startOffset: Offset
> /** The inclusive ending position for `data`.  Incrementally updated 
> during processing, but not complete until execution of the query plan in 
> `data` is finished. */
> def endOffset: Offset
>   }
>   def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], 
> limits: Limits): Epoch
> {code}
> The above would allow us to build an alternative implementation of 
> {{StreamExecution}} that processes continuously with much lower latency and 
> only stops processing when needing to reconfigure the stream (either due to a 
> failure or a user requested change in parallelism.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)

Matthew Tovbin created SPARK-23152:
--

 Summary: Invalid guard condition in 
org.apache.spark.ml.classification.Classifier
 Key: SPARK-23152
 URL: https://issues.apache.org/jira/browse/SPARK-23152
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.1.2, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0, 2.1.3, 2.3.0, 
2.3.1
Reporter: Matthew Tovbin


When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce:

 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 

The error:

 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
 

 

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses

 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it. Therefore the condition has to be 
modified to verify that.
{code:java}
maxLabelRow.size <= 1 
{code}
 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier

2018-01-18 Thread Matthew Tovbin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Tovbin updated SPARK-23152:
---
Description: 
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce: 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 The error: 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
  

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it. Therefore the condition has to be 
modified to verify that.
{code:java}
maxLabelRow.size <= 1 
{code}
 

 

  was:
When fitting a classifier that extends 
"org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
thrown.

Steps to reproduce:

 
{code:java}
val data = spark.createDataset(Seq.empty[(Double, 
org.apache.spark.ml.linalg.Vector)])

new DecisionTreeClassifier()
 .setLabelCol("_1").setFeaturesCol("_2").fit(data)
{code}
 

The error:

 
{code:java}
java.lang.NullPointerException: Value at index 0 is null

at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
at 
org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
at 
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
 

 

The problem happens due to an incorrect guard condition in 
org.apache.spark.ml.classification.Classifier:getNumClasses

 
{code:java}
val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1)
if (maxLabelRow.isEmpty) {
  throw new SparkException("ML algorithm was given empty dataset.")
}
{code}
When the input data is empty the "maxLabelRow" array is not. It contains a 
single element with no columns in it. Therefore the condition has to be 
modified to verify that.
{code:java}
maxLabelRow.size <= 1 
{code}
 

 

 


> Invalid guard condition in org.apache.spark.ml.classification.Classifier
> 
>
> Key: SPARK-23152
> URL: https://issues.apache.org/jira/browse/SPARK-23152
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, 
> 2.3.1
>Reporter: Matthew Tovbin
>Priority: Minor
>  Labels: easyfix
>
> When fitting a classifier that extends 
> "org.apache.spark.ml.classification.Classifier" (NaiveBayes, 
> DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is 
> thrown.
> Steps to reproduce: 
> {code:java}
> val data = spark.createDataset(Seq.empty[(Double, 
> org.apache.spark.ml.linalg.Vector)])
> new DecisionTreeClassifier()
>  .setLabelCol("_1").setFeaturesCol("_2").fit(data)
> {code}
>  The error: 
> {code:java}
> java.lang.NullPointerException: Value at index 0 is null
> at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472)
> at org.apache.spark.sql.Row$class.getDouble(Row.scala:248)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165)
> at 
> org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102)
> at 
> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code}
>

[jira] [Assigned] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified

2018-01-18 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23029:
-

Assignee: Fernando Pereira

> Doc spark.shuffle.file.buffer units are kb when no units specified
> --
>
> Key: SPARK-23029
> URL: https://issues.apache.org/jira/browse/SPARK-23029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Assignee: Fernando Pereira
>Priority: Minor
> Fix For: 2.3.0
>
>
> When setting the spark.shuffle.file.buffer setting, even to its default 
> value, shuffles fail.
> This appears to affect small to medium size partitions. Strangely the error 
> message is OutOfMemoryError, but it works with large partitions (at least 
> >32MB).
> {code}
> pyspark --conf "spark.shuffle.file.buffer=$((32*1024))"
> /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit
>  pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768
> version 2.2.1
> >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", 
> >>> mode="overwrite")
> [Stage 1:>(0 + 10) / 
> 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 11)
> java.lang.OutOfMemoryError: Java heap space
>   at java.io.BufferedOutputStream.(BufferedOutputStream.java:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified

2018-01-18 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23029.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20269
[https://github.com/apache/spark/pull/20269]

> Doc spark.shuffle.file.buffer units are kb when no units specified
> --
>
> Key: SPARK-23029
> URL: https://issues.apache.org/jira/browse/SPARK-23029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Assignee: Fernando Pereira
>Priority: Minor
> Fix For: 2.3.0
>
>
> When setting the spark.shuffle.file.buffer setting, even to its default 
> value, shuffles fail.
> This appears to affect small to medium size partitions. Strangely the error 
> message is OutOfMemoryError, but it works with large partitions (at least 
> >32MB).
> {code}
> pyspark --conf "spark.shuffle.file.buffer=$((32*1024))"
> /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit
>  pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768
> version 2.2.1
> >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", 
> >>> mode="overwrite")
> [Stage 1:>(0 + 10) / 
> 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 
> 11)
> java.lang.OutOfMemoryError: Java heap space
>   at java.io.BufferedOutputStream.(BufferedOutputStream.java:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-13572) HiveContext reads avro Hive tables incorrectly

2018-01-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-13572:
---
Comment: was deleted

(was: Guten Tag, Bonjour, Buongiorno
Besten Dank für Ihr E-Mail. Ich bin zurzeit nicht am Arbeitsplatz und kehre am 
27.06.2016 zurück. Ihr E-Mail wird nicht weitergeleitet.
Merci pour votre courriel. Je suis actuellement absent(e) et serai de retour le 
27.06.2016. Votre courriel ne sera pas dévié.
La ringrazio per la sua e-mail. Al momento sono assente. Sarò di ritorno il 
27.06.2016. Il suo messaggio non è inoltrato a terze persone.
Freundliche Grüsse
Avec mes meilleures salutations
Cordiali saluti
Russ Weedon
Professional ERP Engineer

SBB AG
SBB Informatik Solution Center Finanzen, K-SCM, HR und Immobilien
Lindenhofstrasse 1, 3000 Bern 65
Mobil +41 78 812 47 62
russ.wee...@sbb.ch / www.sbb.ch

)

> HiveContext reads avro Hive tables incorrectly 
> ---
>
> Key: SPARK-13572
> URL: https://issues.apache.org/jira/browse/SPARK-13572
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.0, 1.6.1
> Environment: Hive 0.13.1, Spark 1.5.2
>Reporter: Zoltan Fedor
>Priority: Major
> Attachments: logs, table_definition
>
>
> I am using PySpark to read avro-based tables from Hive and while the avro 
> tables can be read, some of the columns are incorrectly read - showing value 
> {{None}} instead of the actual value.
> {noformat}
> >>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest 
> >>> where year=2016 and month=2 and day=29 limit 3""")
> >>> results_df.take(3)
> [Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
>  Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29),
>  Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, 
> uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, 
> statvalue=None, displayname=None, category=None, 
> source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)]
> {noformat}
> Observe the {{None}} values at most of the fields. Surprisingly not all 
> fields, only some of them are showing {{None}} instead of the real values. 
> The table definition does not show anything specific about these columns.
> Running the same query in Hive:
> {noformat}
> c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where 
> year=2016 and month=2 and day=29 limit 3;
> +--+---++---+---+---+++-+--++-++-+--++--+
> | opsconsole_ingest.kafkaoffsetgeneration  | opsconsole_ingest.kafkapartition 
>  | opsconsole_ingest.kafkaoffset  |  opsconsole_ingest.uuid   |   
>   opsconsole_ingest.mid | opsconsole_ingest.iid | 
> opsconsole_ingest.product  | opsconsole_ingest.utctime  | 
> opsconsole_ingest.statcode  | opsconsole_ingest.statvalue  | 
> opsconsole_ingest.displayname  | opsconsole_ingest.category  | 
> opsconsole_ingest.source_filename  | opsconsole_ingest.year  | 
> opsconsole_ingest.month  | opsconsole_ingest.day  |
> +--+---++---+---+---+++-+--++-++-+--++--+
> | 11.0 | 0.0  
>  | 3.83399394E8   | EF0D03C409681B98646F316CA1088973  | 
> 174f53fb-ca9b-d3f9-64e1-7631bf906817  | ----  
> | est| 2016-01-13T06:58:19| 8 
>

[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-18 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330943#comment-16330943
 ] 

Andrew Ash commented on SPARK-22982:


[~joshrosen] do you have some example stacktraces of what this bug can cause?  
Several of our clusters hit what I think is this problem earlier this month, 
see below for details.

 

For a few days in January (4th through 12th) on our AWS infra, we observed 
massively degraded disk read throughput (down to 33% of previous peaks).  
During this time, we also began observing intermittent exceptions coming from 
Spark at read time of parquet files that a previous Spark job had written.  
When the read throughput recovered on the 12th, we stopped observing the 
exceptions and haven't seen them since.

At first we observed this stacktrace when reading .snappy.parquet files:
{noformat}
java.lang.RuntimeException: java.io.IOException: could not read page Page 
[bytes.size=1048641, valueCount=29945, uncompressedSize=1048641] in col 
[my_column] BINARY
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:493)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:486)
at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:96)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:486)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:157)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:229)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:398)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191)
at 
org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109)
at 
org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: could not read page Page [bytes.size=1048641, 
valueCount=29945, uncompressedSize=1048641] in col [my_column] BINARY
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV1(VectorizedColumnReader.java:562)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$000(VectorizedColumnReader.java:47)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:490)
... 31 more
Caused by: java.io

[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23147:
--

Assignee: Saisai Shao

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23147.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20315
[https://github.com/apache/spark/pull/20315]

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Major
> Fix For: 2.3.0
>
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0

2018-01-18 Thread Louis Burke (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Louis Burke updated SPARK-23151:

Summary: Provide a distribution of Spark with Hadoop 3.0  (was: Provide a 
distribution Spark with Hadoop 3.0)

> Provide a distribution of Spark with Hadoop 3.0
> ---
>
> Key: SPARK-23151
> URL: https://issues.apache.org/jira/browse/SPARK-23151
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Louis Burke
>Priority: Major
>
> Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark 
> package
> only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication 
> is
> that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
> aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23151) Provide a distribution Spark with Hadoop 3.0

2018-01-18 Thread Louis Burke (JIRA)

Louis Burke created SPARK-23151:
---

 Summary: Provide a distribution Spark with Hadoop 3.0
 Key: SPARK-23151
 URL: https://issues.apache.org/jira/browse/SPARK-23151
 Project: Spark
  Issue Type: Dependency upgrade
  Components: Spark Core
Affects Versions: 2.2.1, 2.2.0
Reporter: Louis Burke


Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark package
only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication is
that using up to date Kinesis libraries alongside s3 causes a clash w.r.t
aws-java-sdk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23150) kafka job failing assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic

2018-01-18 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23150.
---
Resolution: Invalid

> kafka job failing assertion failed: Beginning offset 5451 is after the ending 
> offset 5435 for topic
> ---
>
> Key: SPARK-23150
> URL: https://issues.apache.org/jira/browse/SPARK-23150
> Project: Spark
>  Issue Type: Question
>  Components: Build
>Affects Versions: 0.9.0
> Environment: Spark Job failing with below error. Please help
> Kafka version 0.10.0.0
> Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most 
> recent failure: Lost task 3.3 in stage 2.0 (TID 16, 10.3.12.16): 
> java.lang.AssertionError: assertion failed: Beginning offset 5451 is after 
> the ending offset 5435 for topic project1 partition 2. You either provided an 
> invalid fromOffset, or the Kafka topic has been damaged at 
> scala.Predef$.assert(Predef.scala:179) at 
> org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:86) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at 
> org.apache.spark.scheduler.Task.run(Task.scala:70) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745) Driver stacktrace:
>Reporter: Balu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23150) kafka job failing assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic

2018-01-18 Thread Balu (JIRA)

Balu created SPARK-23150:


 Summary: kafka job failing assertion failed: Beginning offset 5451 
is after the ending offset 5435 for topic
 Key: SPARK-23150
 URL: https://issues.apache.org/jira/browse/SPARK-23150
 Project: Spark
  Issue Type: Question
  Components: Build
Affects Versions: 0.9.0
 Environment: Spark Job failing with below error. Please help

Kafka version 0.10.0.0

Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most 
recent failure: Lost task 3.3 in stage 2.0 (TID 16, 10.3.12.16): 
java.lang.AssertionError: assertion failed: Beginning offset 5451 is after the 
ending offset 5435 for topic project1 partition 2. You either provided an 
invalid fromOffset, or the Kafka topic has been damaged at 
scala.Predef$.assert(Predef.scala:179) at 
org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:86) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at 
org.apache.spark.scheduler.Task.run(Task.scala:70) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) Driver stacktrace:
Reporter: Balu






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23149) polish ColumnarBatch

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23149:


Assignee: Wenchen Fan  (was: Apache Spark)

> polish ColumnarBatch
> 
>
> Key: SPARK-23149
> URL: https://issues.apache.org/jira/browse/SPARK-23149
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23149) polish ColumnarBatch

2018-01-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330610#comment-16330610
 ] 

Apache Spark commented on SPARK-23149:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20316

> polish ColumnarBatch
> 
>
> Key: SPARK-23149
> URL: https://issues.apache.org/jira/browse/SPARK-23149
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23149) polish ColumnarBatch

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23149:


Assignee: Apache Spark  (was: Wenchen Fan)

> polish ColumnarBatch
> 
>
> Key: SPARK-23149
> URL: https://issues.apache.org/jira/browse/SPARK-23149
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23149) polish ColumnarBatch

2018-01-18 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-23149:
---

 Summary: polish ColumnarBatch
 Key: SPARK-23149
 URL: https://issues.apache.org/jira/browse/SPARK-23149
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19280) Failed Recovery from checkpoint caused by the multi-threads issue in Spark Streaming scheduler

2018-01-18 Thread Riccardo Vincelli (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330558#comment-16330558
 ] 

Riccardo Vincelli edited comment on SPARK-19280 at 1/18/18 2:20 PM:


What is the exception you are getting? For me it is:

This RDD lacks a SparkContext. It could happen in the following cases: 
 (1) RDD transformations and actions are NOT invoked by the driver, but inside 
of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) 
is invalid because the values transformation and count action cannot be 
performed inside of the rdd1.map transformation. For more information, see 
SPARK-5063.
 (2) When a Spark Streaming job recovers from checkpoint, this exception will 
be hit if a reference to an RDD not defined by the streaming job is used in 
DStream operations. For more information, See SPARK-13758.

Not sure yet if 100% related but our program:

-reads Kafka topics

-uses mapWithState on an initialState

 

Thanks,


was (Author: rvincelli):
What is the exception you are getting? For me it is:

This RDD lacks a SparkContext. It could happen in the following cases: 
(1) RDD transformations and actions are NOT invoked by the driver, but inside 
of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) 
is invalid because the values transformation and count action cannot be 
performed inside of the rdd1.map transformation. For more information, see 
SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be 
hit if a reference to an RDD not defined by the streaming job is used in 
DStream operations. For more information, See SPARK-13758.

Not sure yet if 100% related.

Thanks,

> Failed Recovery from checkpoint caused by the multi-threads issue in Spark 
> Streaming scheduler
> --
>
> Key: SPARK-19280
> URL: https://issues.apache.org/jira/browse/SPARK-19280
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nan Zhu
>Priority: Critical
>
> In one of our applications, we found the following issue, the application 
> recovering from a checkpoint file named "checkpoint-***16670" but with 
> the timestamp ***16650 will recover from the very beginning of the stream 
> and because our application relies on the external & periodically-cleaned 
> data (syncing with checkpoint cleanup), the recovery just failed
> We identified a potential issue in Spark Streaming checkpoint and will 
> describe it with the following example. We will propose a fix in the end of 
> this JIRA.
> 1. The application properties: Batch Duration: 2, Functionality: Single 
> Stream calling ReduceByKeyAndWindow and print, Window Size: 6, 
> SlideDuration, 2
> 2. RDD at 16650 is generated and the corresponding job is submitted to 
> the execution ThreadPool. Meanwhile, a DoCheckpoint message is sent to the 
> queue of JobGenerator
> 3. Job at 16650 is finished and JobCompleted message is sent to 
> JobScheduler's queue, and meanwhile, Job at 16652 is submitted to the 
> execution ThreadPool and similarly, a DoCheckpoint is sent to the queue of 
> JobGenerator
> 4. JobScheduler's message processing thread (I will use JS-EventLoop to 
> identify it) is not scheduled by the operating system for a long time, and 
> during this period, Jobs generated from 16652 - 16670 are generated 
> and completed.
> 5. the message processing thread of JobGenerator (JG-EventLoop) is scheduled 
> and processed all DoCheckpoint messages for jobs ranging from 16652 - 
> 16670 and checkpoint files are successfully written. CRITICAL: at this 
> moment, the lastCheckpointTime would be 16670.
> 6. JS-EventLoop is scheduled, and process all JobCompleted messages for jobs 
> ranging from 16652 - 16670. CRITICAL: a ClearMetadata message is 
> pushed to JobGenerator's message queue for EACH JobCompleted.
> 7. The current message queue contains 20 ClearMetadata messages and 
> JG-EventLoop is scheduled to process them. CRITICAL: ClearMetadata will 
> remove all RDDs out of rememberDuration window. In our case, 
> ReduceyKeyAndWindow will set rememberDuration to 10 (rememberDuration of 
> ReducedWindowDStream (4) + windowSize) resulting that only RDDs <- 
> (16660, 16670] are kept. And ClearMetaData processing logic will push 
> a DoCheckpoint to JobGenerator's thread
> 8. JG-EventLoop is scheduled again to process DoCheckpoint for 1665, VERY 
> CRITICAL: at this step, RDD no later than 16660 has been removed, and 
> checkpoint data is updated as  
> https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/sp

[jira] [Commented] (SPARK-19280) Failed Recovery from checkpoint caused by the multi-threads issue in Spark Streaming scheduler

2018-01-18 Thread Riccardo Vincelli (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330558#comment-16330558
 ] 

Riccardo Vincelli commented on SPARK-19280:
---

What is the exception you are getting? For me it is:

This RDD lacks a SparkContext. It could happen in the following cases: 
(1) RDD transformations and actions are NOT invoked by the driver, but inside 
of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) 
is invalid because the values transformation and count action cannot be 
performed inside of the rdd1.map transformation. For more information, see 
SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be 
hit if a reference to an RDD not defined by the streaming job is used in 
DStream operations. For more information, See SPARK-13758.

Not sure yet if 100% related.

Thanks,

> Failed Recovery from checkpoint caused by the multi-threads issue in Spark 
> Streaming scheduler
> --
>
> Key: SPARK-19280
> URL: https://issues.apache.org/jira/browse/SPARK-19280
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0
>Reporter: Nan Zhu
>Priority: Critical
>
> In one of our applications, we found the following issue, the application 
> recovering from a checkpoint file named "checkpoint-***16670" but with 
> the timestamp ***16650 will recover from the very beginning of the stream 
> and because our application relies on the external & periodically-cleaned 
> data (syncing with checkpoint cleanup), the recovery just failed
> We identified a potential issue in Spark Streaming checkpoint and will 
> describe it with the following example. We will propose a fix in the end of 
> this JIRA.
> 1. The application properties: Batch Duration: 2, Functionality: Single 
> Stream calling ReduceByKeyAndWindow and print, Window Size: 6, 
> SlideDuration, 2
> 2. RDD at 16650 is generated and the corresponding job is submitted to 
> the execution ThreadPool. Meanwhile, a DoCheckpoint message is sent to the 
> queue of JobGenerator
> 3. Job at 16650 is finished and JobCompleted message is sent to 
> JobScheduler's queue, and meanwhile, Job at 16652 is submitted to the 
> execution ThreadPool and similarly, a DoCheckpoint is sent to the queue of 
> JobGenerator
> 4. JobScheduler's message processing thread (I will use JS-EventLoop to 
> identify it) is not scheduled by the operating system for a long time, and 
> during this period, Jobs generated from 16652 - 16670 are generated 
> and completed.
> 5. the message processing thread of JobGenerator (JG-EventLoop) is scheduled 
> and processed all DoCheckpoint messages for jobs ranging from 16652 - 
> 16670 and checkpoint files are successfully written. CRITICAL: at this 
> moment, the lastCheckpointTime would be 16670.
> 6. JS-EventLoop is scheduled, and process all JobCompleted messages for jobs 
> ranging from 16652 - 16670. CRITICAL: a ClearMetadata message is 
> pushed to JobGenerator's message queue for EACH JobCompleted.
> 7. The current message queue contains 20 ClearMetadata messages and 
> JG-EventLoop is scheduled to process them. CRITICAL: ClearMetadata will 
> remove all RDDs out of rememberDuration window. In our case, 
> ReduceyKeyAndWindow will set rememberDuration to 10 (rememberDuration of 
> ReducedWindowDStream (4) + windowSize) resulting that only RDDs <- 
> (16660, 16670] are kept. And ClearMetaData processing logic will push 
> a DoCheckpoint to JobGenerator's thread
> 8. JG-EventLoop is scheduled again to process DoCheckpoint for 1665, VERY 
> CRITICAL: at this step, RDD no later than 16660 has been removed, and 
> checkpoint data is updated as  
> https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L53
>  and 
> https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L59.
> 9. After 8, a checkpoint named /path/checkpoint-16670 is created but with 
> the timestamp 16650. and at this moment, Application crashed
> 10. Application recovers from /path/checkpoint-16670 and try to get RDD 
> with validTime 16650. Of course it will not find it and has to recompute. 
> In the case of ReduceByKeyAndWindow, it needs to recursively slice RDDs until 
> to the start of the stream. When the stream depends on the external data, it 
> will not successfully recover. In the case of Kafka, the recovered RDDs would 
> not be the same as the original one, as the currentOffsets has been u

[jira] [Resolved] (SPARK-23141) Support data type string as a returnType for registerJavaFunction.

2018-01-18 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23141.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20307
[https://github.com/apache/spark/pull/20307]

> Support data type string as a returnType for registerJavaFunction.
> --
>
> Key: SPARK-23141
> URL: https://issues.apache.org/jira/browse/SPARK-23141
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently {{UDFRegistration.registerJavaFunction}} doesn't support data type 
> string as a returnType whereas {{UDFRegistration.register}}, {{@udf}}, or 
> {{@pandas_udf}} does.
>  We can support it for {{UDFRegistration.registerJavaFunction}} as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23141) Support data type string as a returnType for registerJavaFunction.

2018-01-18 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23141:


Assignee: Takuya Ueshin

> Support data type string as a returnType for registerJavaFunction.
> --
>
> Key: SPARK-23141
> URL: https://issues.apache.org/jira/browse/SPARK-23141
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.3.0
>
>
> Currently {{UDFRegistration.registerJavaFunction}} doesn't support data type 
> string as a returnType whereas {{UDFRegistration.register}}, {{@udf}}, or 
> {{@pandas_udf}} does.
>  We can support it for {{UDFRegistration.registerJavaFunction}} as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22036) BigDecimal multiplication sometimes returns null

2018-01-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22036.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20023
[https://github.com/apache/spark/pull/20023]

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> The multiplication of two BigDecimal numbers sometimes returns null. Here is 
> a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22036) BigDecimal multiplication sometimes returns null

2018-01-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22036:
---

Assignee: Marco Gaido

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> The multiplication of two BigDecimal numbers sometimes returns null. Here is 
> a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330496#comment-16330496
 ] 

Apache Spark commented on SPARK-23147:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/20315

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23147:


Assignee: (was: Apache Spark)

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23147:


Assignee: Apache Spark

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces

2018-01-18 Thread Bogdan Raducanu (JIRA)

Bogdan Raducanu created SPARK-23148:
---

 Summary: spark.read.csv with multiline=true gives 
FileNotFoundException if path contains spaces
 Key: SPARK-23148
 URL: https://issues.apache.org/jira/browse/SPARK-23148
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Bogdan Raducanu


Repro code:
{code:java}
spark.range(10).write.csv("/tmp/a b c/a.csv")
spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count
10
spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count
java.io.FileNotFoundException: File 
file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv
 does not exist
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-23147:

Description: 
Current Stage page's task table will throw an exception when there's no 
complete tasks, to reproduce this issue, user could submit code like:

{code}
sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
{code}

Then open the UI and click into stage details. Below is the screenshot.

 !Screen Shot 2018-01-18 at 8.50.08 PM.png! 

Deep dive into the code, found that current UI can only show the completed 
tasks, it is different from 2.2 code.


  was:
Current Stage page's task table will throw an exception when there's no 
complete tasks, to reproduce this issue, user could submit code like:

{code}
sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
{code}

Then open the UI and click into stage details.

Deep dive into the code, found that current UI can only show the completed 
tasks, it is different from 2.2 code.



> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details. Below is the screenshot.
>  !Screen Shot 2018-01-18 at 8.50.08 PM.png! 
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-23147:
---

 Summary: Stage page will throw exception when there's no complete 
tasks
 Key: SPARK-23147
 URL: https://issues.apache.org/jira/browse/SPARK-23147
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: Saisai Shao
 Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png

Current Stage page's task table will throw an exception when there's no 
complete tasks, to reproduce this issue, user could submit code like:

{code}
sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
{code}

Then open the UI and click into stage details.

Deep dive into the code, found that current UI can only show the completed 
tasks, it is different from 2.2 code.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23147) Stage page will throw exception when there's no complete tasks

2018-01-18 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-23147:

Attachment: Screen Shot 2018-01-18 at 8.50.08 PM.png

> Stage page will throw exception when there's no complete tasks
> --
>
> Key: SPARK-23147
> URL: https://issues.apache.org/jira/browse/SPARK-23147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Priority: Major
> Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png
>
>
> Current Stage page's task table will throw an exception when there's no 
> complete tasks, to reproduce this issue, user could submit code like:
> {code}
> sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect()
> {code}
> Then open the UI and click into stage details.
> Deep dive into the code, found that current UI can only show the completed 
> tasks, it is different from 2.2 code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 123 matches

Mail list logo