[jira] [Comment Edited] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331847#comment-16331847 ] Gaurav edited comment on SPARK-18016 at 1/19/18 7:31 AM: - saving DataFrame having 4 columns in snappy.parquet format in HDFS still giving Constant pool exception using "Spark-core 2.4.0-SNAPSHOT" jar mentioned below: *Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection has grown past JVM limit of 0x* *at org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)* I have requirement where i am having more than 60K columns wide DataFrame of which correlation has to be calculated. was (Author: gaurav.garg): saving DataFrame having 4 columns in snappy.parquet format in HDFS still giving Constant pool exception using "Spark-core 2.4.0-SNAPSHOT" jar I have requirement where i am having more than 60K columns wide DataFrame of which correlation has to be calculated. > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.comp
[jira] [Comment Edited] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331847#comment-16331847 ] Gaurav edited comment on SPARK-18016 at 1/19/18 7:29 AM: - saving DataFrame having 4 columns in snappy.parquet format in HDFS still giving Constant pool exception using "Spark-core 2.4.0-SNAPSHOT" jar I have requirement where i am having more than 60K columns wide DataFrame of which correlation has to be calculated. was (Author: gaurav.garg): saving DataFrame having 4 columns in snappy.parquet format in HDFS still giving Constant pool exception using "Spark-core 2.4.0-SNAPSHOT" jar > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemb
[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331847#comment-16331847 ] Gaurav commented on SPARK-18016: saving DataFrame having 4 columns in snappy.parquet format in HDFS still giving Constant pool exception using "Spark-core 2.4.0-SNAPSHOT" jar > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at org.codehaus
[jira] [Commented] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
[ https://issues.apache.org/jira/browse/SPARK-23148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331817#comment-16331817 ] jifei_yang commented on SPARK-23148: If you must do this, I think it is best to add escaping it. > spark.read.csv with multiline=true gives FileNotFoundException if path > contains spaces > -- > > Key: SPARK-23148 > URL: https://issues.apache.org/jira/browse/SPARK-23148 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bogdan Raducanu >Priority: Major > > Repro code: > {code:java} > spark.range(10).write.csv("/tmp/a b c/a.csv") > spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count > 10 > spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count > java.io.FileNotFoundException: File > file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv > does not exist > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!
[ https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331815#comment-16331815 ] Gera Shegalov commented on SPARK-12963: --- We hit the same issue on nodes where a process is not allowed to listen on all NIC. An easy fix is to make sure that the Driver in ApplicationMaster inherits an explicitly configured public hostname of the NodeManager. > In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' > failed after 16 retries! > - > > Key: SPARK-12963 > URL: https://issues.apache.org/jira/browse/SPARK-12963 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0 >Reporter: lichenglin >Priority: Critical > > I have 3 node cluster:namenode second and data1; > I use this shell to submit job on namenode: > bin/spark-submit --deploy-mode cluster --class com.bjdv.spark.job.Abc > --total-executor-cores 5 --master spark://namenode:6066 > hdfs://namenode:9000/sparkjars/spark.jar > The Driver may be started on the other node such as data1. > The problem is : > when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode > the driver will be started with this param such as > SPARK_LOCAL_IP=namenode > but the driver will start at data1, > the dirver will try to binding the ip 'namenode' on data1. > so driver will throw exception like this: > Service 'Driver' failed after 16 retries! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence
[ https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331811#comment-16331811 ] Joseph K. Bradley commented on SPARK-23154: --- CC [~mlnick], [~yanboliang] or others, what do you think? > Document backwards compatibility guarantees for ML persistence > -- > > Key: SPARK-23154 > URL: https://issues.apache.org/jira/browse/SPARK-23154 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > > We have (as far as I know) maintained backwards compatibility for ML > persistence, but this is not documented anywhere. I'd like us to document it > (for spark.ml, not for spark.mllib). > I'd recommend something like: > {quote} > In general, MLlib maintains backwards compatibility for ML persistence. > I.e., if you save an ML model or Pipeline in one version of Spark, then you > should be able to load it back and use it in a future version of Spark. > However, there are rare exceptions, described below. > Model persistence: Is a model or Pipeline saved using Apache Spark ML > persistence in Spark version X loadable by Spark version Y? > * Major versions: No guarantees, but best-effort. > * Minor and patch versions: Yes; these are backwards compatible. > * Note about the format: There are no guarantees for a stable persistence > format, but model loading itself is designed to be backwards compatible. > Model behavior: Does a model or Pipeline in Spark version X behave > identically in Spark version Y? > * Major versions: No guarantees, but best-effort. > * Minor and patch versions: Identical behavior, except for bug fixes. > For both model persistence and model behavior, any breaking changes across a > minor version or patch version are reported in the Spark version release > notes. If a breakage is not reported in release notes, then it should be > treated as a bug to be fixed. > {quote} > How does this sound? > Note: We unfortunately don't have tests for backwards compatibility (which > has technical hurdles and can be discussed in [SPARK-15573]). However, we > have made efforts to maintain it during PR review and Spark release QA, and > most users expect it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!
[ https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331799#comment-16331799 ] Apache Spark commented on SPARK-12963: -- User 'gerashegalov' has created a pull request for this issue: https://github.com/apache/spark/pull/20327 > In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' > failed after 16 retries! > - > > Key: SPARK-12963 > URL: https://issues.apache.org/jira/browse/SPARK-12963 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0 >Reporter: lichenglin >Priority: Critical > > I have 3 node cluster:namenode second and data1; > I use this shell to submit job on namenode: > bin/spark-submit --deploy-mode cluster --class com.bjdv.spark.job.Abc > --total-executor-cores 5 --master spark://namenode:6066 > hdfs://namenode:9000/sparkjars/spark.jar > The Driver may be started on the other node such as data1. > The problem is : > when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode > the driver will be started with this param such as > SPARK_LOCAL_IP=namenode > but the driver will start at data1, > the dirver will try to binding the ip 'namenode' on data1. > so driver will throw exception like this: > Service 'Driver' failed after 16 retries! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down
[ https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23155: Assignee: (was: Apache Spark) > YARN-aggregated executor/driver logs appear unavailable when NM is down > --- > > Key: SPARK-23155 > URL: https://issues.apache.org/jira/browse/SPARK-23155 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gera Shegalov >Priority: Major > > Unlike MapReduce JobHistory Server, Spark history server isn't rewriting > container log URL's to point to the aggregated yarn.log.server.url location > and relies on the NodeManager webUI to trigger a redirect. This fails when > the NM is down. Note that NM may be down permanently after decommissioning in > traditional environments or when used in a cloud environment such as AWS EMR > where either worker nodes are taken away with autoscale, the whole cluster is > used to run a single job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down
[ https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23155: Assignee: Apache Spark > YARN-aggregated executor/driver logs appear unavailable when NM is down > --- > > Key: SPARK-23155 > URL: https://issues.apache.org/jira/browse/SPARK-23155 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gera Shegalov >Assignee: Apache Spark >Priority: Major > > Unlike MapReduce JobHistory Server, Spark history server isn't rewriting > container log URL's to point to the aggregated yarn.log.server.url location > and relies on the NodeManager webUI to trigger a redirect. This fails when > the NM is down. Note that NM may be down permanently after decommissioning in > traditional environments or when used in a cloud environment such as AWS EMR > where either worker nodes are taken away with autoscale, the whole cluster is > used to run a single job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down
[ https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331776#comment-16331776 ] Apache Spark commented on SPARK-23155: -- User 'gerashegalov' has created a pull request for this issue: https://github.com/apache/spark/pull/20326 > YARN-aggregated executor/driver logs appear unavailable when NM is down > --- > > Key: SPARK-23155 > URL: https://issues.apache.org/jira/browse/SPARK-23155 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gera Shegalov >Priority: Major > > Unlike MapReduce JobHistory Server, Spark history server isn't rewriting > container log URL's to point to the aggregated yarn.log.server.url location > and relies on the NodeManager webUI to trigger a redirect. This fails when > the NM is down. Note that NM may be down permanently after decommissioning in > traditional environments or when used in a cloud environment such as AWS EMR > where either worker nodes are taken away with autoscale, the whole cluster is > used to run a single job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22808) saveAsTable() should be marked as deprecated
[ https://issues.apache.org/jira/browse/SPARK-22808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22808: Assignee: Apache Spark > saveAsTable() should be marked as deprecated > > > Key: SPARK-22808 > URL: https://issues.apache.org/jira/browse/SPARK-22808 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Jason Vaccaro >Assignee: Apache Spark >Priority: Major > > As discussed in SPARK-16803, saveAsTable is not supported as a method for > writing to Hive and insertInto should be used instead. However, on the java > api documentation for version 2.1.1, the saveAsTable method is not marked as > deprecated and the programming guides indicate that saveAsTable is the proper > way to write to Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22808) saveAsTable() should be marked as deprecated
[ https://issues.apache.org/jira/browse/SPARK-22808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22808: Assignee: (was: Apache Spark) > saveAsTable() should be marked as deprecated > > > Key: SPARK-22808 > URL: https://issues.apache.org/jira/browse/SPARK-22808 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Jason Vaccaro >Priority: Major > > As discussed in SPARK-16803, saveAsTable is not supported as a method for > writing to Hive and insertInto should be used instead. However, on the java > api documentation for version 2.1.1, the saveAsTable method is not marked as > deprecated and the programming guides indicate that saveAsTable is the proper > way to write to Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22808) saveAsTable() should be marked as deprecated
[ https://issues.apache.org/jira/browse/SPARK-22808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331773#comment-16331773 ] Apache Spark commented on SPARK-22808: -- User 'brandonJY' has created a pull request for this issue: https://github.com/apache/spark/pull/20325 > saveAsTable() should be marked as deprecated > > > Key: SPARK-22808 > URL: https://issues.apache.org/jira/browse/SPARK-22808 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.1 >Reporter: Jason Vaccaro >Priority: Major > > As discussed in SPARK-16803, saveAsTable is not supported as a method for > writing to Hive and insertInto should be used instead. However, on the java > api documentation for version 2.1.1, the saveAsTable method is not marked as > deprecated and the programming guides indicate that saveAsTable is the proper > way to write to Hive. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down
Gera Shegalov created SPARK-23155: - Summary: YARN-aggregated executor/driver logs appear unavailable when NM is down Key: SPARK-23155 URL: https://issues.apache.org/jira/browse/SPARK-23155 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.2.1 Reporter: Gera Shegalov Unlike MapReduce JobHistory Server, Spark history server isn't rewriting container log URL's to point to the aggregated yarn.log.server.url location and relies on the NodeManager webUI to trigger a redirect. This fails when the NM is down. Note that NM may be down permanently after decommissioning in traditional environments or when used in a cloud environment such as AWS EMR where either worker nodes are taken away with autoscale, the whole cluster is used to run a single job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22977) DataFrameWriter operations do not show details in SQL tab
[ https://issues.apache.org/jira/browse/SPARK-22977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331754#comment-16331754 ] Yuming Wang commented on SPARK-22977: - cc [~cloud_fan] > DataFrameWriter operations do not show details in SQL tab > - > > Key: SPARK-22977 > URL: https://issues.apache.org/jira/browse/SPARK-22977 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.3.0 >Reporter: Yuming Wang >Priority: Major > Attachments: after.png, before.png > > > When CreateHiveTableAsSelectCommand or InsertIntoHiveTable, SQL tab don't > show details after > [SPARK-20213|https://issues.apache.org/jira/browse/SPARK-20213]. > *Before*: > !before.png! > *After*: > !after.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23097) Migrate text socket source
[ https://issues.apache.org/jira/browse/SPARK-23097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331735#comment-16331735 ] Saisai Shao commented on SPARK-23097: - Thanks, will do. > Migrate text socket source > -- > > Key: SPARK-23097 > URL: https://issues.apache.org/jira/browse/SPARK-23097 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23097) Migrate text socket source
[ https://issues.apache.org/jira/browse/SPARK-23097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331726#comment-16331726 ] Jose Torres commented on SPARK-23097: - Certainly! There's no specific plan right now. I'm working on a list of pointers for migrating sources; shoot me an email if you want a link to a rough draft. > Migrate text socket source > -- > > Key: SPARK-23097 > URL: https://issues.apache.org/jira/browse/SPARK-23097 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23097) Migrate text socket source
[ https://issues.apache.org/jira/browse/SPARK-23097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331715#comment-16331715 ] Saisai Shao commented on SPARK-23097: - Hi [~joseph.torres], would you mind if I take a shot on this issue (in case you don't have a plan on it) :). > Migrate text socket source > -- > > Key: SPARK-23097 > URL: https://issues.apache.org/jira/browse/SPARK-23097 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23054) Incorrect results of casting UserDefinedType to String
[ https://issues.apache.org/jira/browse/SPARK-23054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-23054. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.3.0 > Incorrect results of casting UserDefinedType to String > -- > > Key: SPARK-23054 > URL: https://issues.apache.org/jira/browse/SPARK-23054 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 2.3.0 > > > {code} > >>> from pyspark.ml.classification import MultilayerPerceptronClassifier > >>> from pyspark.ml.linalg import Vectors > >>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, > >>> Vectors.dense([0.0, 1.0]))], ["label", "features"]) > >>> df.selectExpr("CAST(features AS STRING)").show(truncate = False) > +---+ > |features | > +---+ > |[6,1,0,0,280020,2,0,0,0] | > |[6,1,0,0,280020,2,0,0,3ff0]| > +---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23154) Document backwards compatibility guarantees for ML persistence
[ https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23154: -- Description: We have (as far as I know) maintained backwards compatibility for ML persistence, but this is not documented anywhere. I'd like us to document it (for spark.ml, not for spark.mllib). I'd recommend something like: {quote} In general, MLlib maintains backwards compatibility for ML persistence. I.e., if you save an ML model or Pipeline in one version of Spark, then you should be able to load it back and use it in a future version of Spark. However, there are rare exceptions, described below. Model persistence: Is a model or Pipeline saved using Apache Spark ML persistence in Spark version X loadable by Spark version Y? * Major versions: No guarantees, but best-effort. * Minor and patch versions: Yes; these are backwards compatible. * Note about the format: There are no guarantees for a stable persistence format, but model loading itself is designed to be backwards compatible. Model behavior: Does a model or Pipeline in Spark version X behave identically in Spark version Y? * Major versions: No guarantees, but best-effort. * Minor and patch versions: Identical behavior, except for bug fixes. For both model persistence and model behavior, any breaking changes across a minor version or patch version are reported in the Spark version release notes. If a breakage is not reported in release notes, then it should be treated as a bug to be fixed. {quote} How does this sound? Note: We unfortunately don't have tests for backwards compatibility (which has technical hurdles and can be discussed in [SPARK-15573]). However, we have made efforts to maintain it during PR review and Spark release QA, and most users expect it. was: We have (as far as I know) maintained backwards compatibility for ML persistence, but this is not documented anywhere. I'd like us to document it (for spark.ml, not for spark.mllib). I'd recommend something like: {quote} In general, MLlib maintains backwards compatibility for ML persistence. I.e., if you save an ML model or Pipeline in one version of Spark, then you should be able to load it back and use it in a future version of Spark. However, there are rare exceptions, described below. Model persistence: Is a model or Pipeline saved using Apache Spark ML persistence in Spark version X loadable by Spark version Y? * Major versions: No guarantees, but best-effort. * Minor and patch versions: Yes; these are backwards compatible. * Note about the format: There are no guarantees for a stable persistence format, but model loading itself is designed to be backwards compatible. Model behavior: Does a model or Pipeline in Spark version X behave identically in Spark version Y? * Major versions: No guarantees, but best-effort. * Minor and patch versions: Identical behavior, except for bug fixes. For both model persistence and model behavior, any breaking changes across a minor version or patch version are reported in the Spark version release notes. If a breakage is not reported in release notes, then it should be treated as a bug to be fixed. {quote} How does this sound? > Document backwards compatibility guarantees for ML persistence > -- > > Key: SPARK-23154 > URL: https://issues.apache.org/jira/browse/SPARK-23154 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > > We have (as far as I know) maintained backwards compatibility for ML > persistence, but this is not documented anywhere. I'd like us to document it > (for spark.ml, not for spark.mllib). > I'd recommend something like: > {quote} > In general, MLlib maintains backwards compatibility for ML persistence. > I.e., if you save an ML model or Pipeline in one version of Spark, then you > should be able to load it back and use it in a future version of Spark. > However, there are rare exceptions, described below. > Model persistence: Is a model or Pipeline saved using Apache Spark ML > persistence in Spark version X loadable by Spark version Y? > * Major versions: No guarantees, but best-effort. > * Minor and patch versions: Yes; these are backwards compatible. > * Note about the format: There are no guarantees for a stable persistence > format, but model loading itself is designed to be backwards compatible. > Model behavior: Does a model or Pipeline in Spark version X behave > identically in Spark version Y? > * Major versions: No guarantees, but best-effort. > * Minor and patch versions: Identical behavior, except for bug fixes. > For both model persistence and model behavior, any brea
[jira] [Created] (SPARK-23154) Document backwards compatibility guarantees for ML persistence
Joseph K. Bradley created SPARK-23154: - Summary: Document backwards compatibility guarantees for ML persistence Key: SPARK-23154 URL: https://issues.apache.org/jira/browse/SPARK-23154 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 2.3.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley We have (as far as I know) maintained backwards compatibility for ML persistence, but this is not documented anywhere. I'd like us to document it (for spark.ml, not for spark.mllib). I'd recommend something like: {quote} In general, MLlib maintains backwards compatibility for ML persistence. I.e., if you save an ML model or Pipeline in one version of Spark, then you should be able to load it back and use it in a future version of Spark. However, there are rare exceptions, described below. Model persistence: Is a model or Pipeline saved using Apache Spark ML persistence in Spark version X loadable by Spark version Y? * Major versions: No guarantees, but best-effort. * Minor and patch versions: Yes; these are backwards compatible. * Note about the format: There are no guarantees for a stable persistence format, but model loading itself is designed to be backwards compatible. Model behavior: Does a model or Pipeline in Spark version X behave identically in Spark version Y? * Major versions: No guarantees, but best-effort. * Minor and patch versions: Identical behavior, except for bug fixes. For both model persistence and model behavior, any breaking changes across a minor version or patch version are reported in the Spark version release notes. If a breakage is not reported in release notes, then it should be treated as a bug to be fixed. {quote} How does this sound? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23091) Incorrect unit test for approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23091: Assignee: Apache Spark > Incorrect unit test for approxQuantile > -- > > Key: SPARK-23091 > URL: https://issues.apache.org/jira/browse/SPARK-23091 > Project: Spark > Issue Type: Improvement > Components: ML, Tests >Affects Versions: 2.2.1 >Reporter: Kuang Chen >Assignee: Apache Spark >Priority: Minor > > Currently, test for `approxQuantile` (quantile estimation algorithm) checks > whether estimated quantile is within +- 2*`relativeError` from the true > quantile. See the code below: > [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157] > However, based on the original paper by Greenwald and Khanna, the estimated > quantile is guaranteed to be within +- `relativeError` from the true > quantile. Using the double "tolerance" is misleading and incorrect, and we > should fix it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331542#comment-16331542 ] Saisai Shao commented on SPARK-23128: - I'm wondering if we need a SPIP for such changes? > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Priority: Major > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23091) Incorrect unit test for approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23091: Assignee: (was: Apache Spark) > Incorrect unit test for approxQuantile > -- > > Key: SPARK-23091 > URL: https://issues.apache.org/jira/browse/SPARK-23091 > Project: Spark > Issue Type: Improvement > Components: ML, Tests >Affects Versions: 2.2.1 >Reporter: Kuang Chen >Priority: Minor > > Currently, test for `approxQuantile` (quantile estimation algorithm) checks > whether estimated quantile is within +- 2*`relativeError` from the true > quantile. See the code below: > [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157] > However, based on the original paper by Greenwald and Khanna, the estimated > quantile is guaranteed to be within +- `relativeError` from the true > quantile. Using the double "tolerance" is misleading and incorrect, and we > should fix it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23091) Incorrect unit test for approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-23091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331541#comment-16331541 ] Apache Spark commented on SPARK-23091: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/20324 > Incorrect unit test for approxQuantile > -- > > Key: SPARK-23091 > URL: https://issues.apache.org/jira/browse/SPARK-23091 > Project: Spark > Issue Type: Improvement > Components: ML, Tests >Affects Versions: 2.2.1 >Reporter: Kuang Chen >Priority: Minor > > Currently, test for `approxQuantile` (quantile estimation algorithm) checks > whether estimated quantile is within +- 2*`relativeError` from the true > quantile. See the code below: > [https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala#L157] > However, based on the original paper by Greenwald and Khanna, the estimated > quantile is guaranteed to be within +- `relativeError` from the true > quantile. Using the double "tolerance" is misleading and incorrect, and we > should fix it. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23142) Add documentation for Continuous Processing
[ https://issues.apache.org/jira/browse/SPARK-23142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23142. --- Resolution: Fixed Fix Version/s: 2.3.0 3.0.0 Issue resolved by pull request 20308 [https://github.com/apache/spark/pull/20308] > Add documentation for Continuous Processing > --- > > Key: SPARK-23142 > URL: https://issues.apache.org/jira/browse/SPARK-23142 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 3.0.0, 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22962: -- Assignee: Yinan Li > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Yinan Li >Priority: Major > Fix For: 2.3.0 > > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22962. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20320 [https://github.com/apache/spark/pull/20320] > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Yinan Li >Priority: Major > Fix For: 2.3.0 > > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail
[ https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23094. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20302 [https://github.com/apache/spark/pull/20302] > Json Readers choose wrong encoding when bad records are present and fail > > > Key: SPARK-23094 > URL: https://issues.apache.org/jira/browse/SPARK-23094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 2.3.0 > > > The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser > code paths for expressions but not the readers. We should also cover reader > code paths reading files with bad characters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23094) Json Readers choose wrong encoding when bad records are present and fail
[ https://issues.apache.org/jira/browse/SPARK-23094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23094: Assignee: Burak Yavuz > Json Readers choose wrong encoding when bad records are present and fail > > > Key: SPARK-23094 > URL: https://issues.apache.org/jira/browse/SPARK-23094 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Major > Fix For: 2.3.0 > > > The cases described in SPARK-16548 and SPARK-20549 handled the JsonParser > code paths for expressions but not the readers. We should also cover reader > code paths reading files with bad characters. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22216) Improving PySpark/Pandas interoperability
[ https://issues.apache.org/jira/browse/SPARK-22216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Jin updated SPARK-22216: --- Description: This is an umbrella ticket tracking the general effort to improve performance and interoperability between PySpark and Pandas. The core idea is to Apache Arrow as serialization format to reduce the overhead between PySpark and Pandas. (was: This is an umbrella ticket tracking the general effect of improving performance and interoperability between PySpark and Pandas. The core idea is to Apache Arrow as serialization format to reduce the overhead between PySpark and Pandas.) > Improving PySpark/Pandas interoperability > - > > Key: SPARK-22216 > URL: https://issues.apache.org/jira/browse/SPARK-22216 > Project: Spark > Issue Type: Epic > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Li Jin >Assignee: Li Jin >Priority: Major > > This is an umbrella ticket tracking the general effort to improve performance > and interoperability between PySpark and Pandas. The core idea is to Apache > Arrow as serialization format to reduce the overhead between PySpark and > Pandas. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23133) Spark options are not passed to the Executor in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23133: -- Assignee: Andrew Korzhuev > Spark options are not passed to the Executor in Docker context > -- > > Key: SPARK-23133 > URL: https://issues.apache.org/jira/browse/SPARK-23133 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8s using supplied Docker image. >Reporter: Andrew Korzhuev >Assignee: Andrew Korzhuev >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Reproduce: > # Build image with `bin/docker-image-tool.sh`. > # Submit application to k8s. Set executor options, e.g. ` --conf > "spark.executor. > extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"` > # Visit Spark UI on executor and notice that option is not set. > Expected behavior: options from spark-submit should be correctly passed to > executor. > Cause: > `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh` > https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70 > [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23133) Spark options are not passed to the Executor in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23133. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20322 [https://github.com/apache/spark/pull/20322] > Spark options are not passed to the Executor in Docker context > -- > > Key: SPARK-23133 > URL: https://issues.apache.org/jira/browse/SPARK-23133 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8s using supplied Docker image. >Reporter: Andrew Korzhuev >Priority: Minor > Fix For: 2.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Reproduce: > # Build image with `bin/docker-image-tool.sh`. > # Submit application to k8s. Set executor options, e.g. ` --conf > "spark.executor. > extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"` > # Visit Spark UI on executor and notice that option is not set. > Expected behavior: options from spark-submit should be correctly passed to > executor. > Cause: > `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh` > https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70 > [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23153) Support application dependencies in submission client's local file system
Yinan Li created SPARK-23153: Summary: Support application dependencies in submission client's local file system Key: SPARK-23153 URL: https://issues.apache.org/jira/browse/SPARK-23153 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 2.4.0 Reporter: Yinan Li -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23133) Spark options are not passed to the Executor in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331241#comment-16331241 ] Apache Spark commented on SPARK-23133: -- User 'foxish' has created a pull request for this issue: https://github.com/apache/spark/pull/20322 > Spark options are not passed to the Executor in Docker context > -- > > Key: SPARK-23133 > URL: https://issues.apache.org/jira/browse/SPARK-23133 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8s using supplied Docker image. >Reporter: Andrew Korzhuev >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > Reproduce: > # Build image with `bin/docker-image-tool.sh`. > # Submit application to k8s. Set executor options, e.g. ` --conf > "spark.executor. > extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"` > # Visit Spark UI on executor and notice that option is not set. > Expected behavior: options from spark-submit should be correctly passed to > executor. > Cause: > `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh` > https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70 > [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-19185: -- Assignee: Marcelo Vanzin > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau >Assignee: Marcelo Vanzin >Priority: Major > Labels: streaming, windowing > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd
[jira] [Assigned] (SPARK-19185) ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing
[ https://issues.apache.org/jira/browse/SPARK-19185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-19185: -- Assignee: (was: Marcelo Vanzin) > ConcurrentModificationExceptions with CachedKafkaConsumers when Windowing > - > > Key: SPARK-19185 > URL: https://issues.apache.org/jira/browse/SPARK-19185 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: Spark 2.0.2 > Spark Streaming Kafka 010 > Mesos 0.28.0 - client mode > spark.executor.cores 1 > spark.mesos.extra.cores 1 >Reporter: Kalvin Chau >Priority: Major > Labels: streaming, windowing > > We've been running into ConcurrentModificationExcpetions "KafkaConsumer is > not safe for multi-threaded access" with the CachedKafkaConsumer. I've been > working through debugging this issue and after looking through some of the > spark source code I think this is a bug. > Our set up is: > Spark 2.0.2, running in Mesos 0.28.0-2 in client mode, using > Spark-Streaming-Kafka-010 > spark.executor.cores 1 > spark.mesos.extra.cores 1 > Batch interval: 10s, window interval: 180s, and slide interval: 30s > We would see the exception when in one executor there are two task worker > threads assigned the same Topic+Partition, but a different set of offsets. > They would both get the same CachedKafkaConsumer, and whichever task thread > went first would seek and poll for all the records, and at the same time the > second thread would try to seek to its offset but fail because it is unable > to acquire the lock. > Time0 E0 Task0 - TopicPartition("abc", 0) X to Y > Time0 E0 Task1 - TopicPartition("abc", 0) Y to Z > Time1 E0 Task0 - Seeks and starts to poll > Time1 E0 Task1 - Attempts to seek, but fails > Here are some relevant logs: > {code} > 17/01/06 03:10:01 Executor task launch worker-1 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394204414 -> 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO KafkaRDD: Computing > topic test-topic, partition 2 offsets 4394238058 -> 4394257712 > 17/01/06 03:10:01 Executor task launch worker-1 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394204414 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Get spark-executor-consumer test-topic 2 nextOffset 4394204414 requested > 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 INFO CachedKafkaConsumer: > Initial fetch for spark-executor-consumer test-topic 2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 DEBUG CachedKafkaConsumer: > Seeking to test-topic-2 4394238058 > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Putting > block rdd_199_2 failed due to an exception > 17/01/06 03:10:01 Executor task launch worker-0 WARN BlockManager: Block > rdd_199_2 could not be removed as it was not found on disk or in memory > 17/01/06 03:10:01 Executor task launch worker-0 ERROR Executor: Exception in > task 49.0 in stage 45.0 (TID 3201) > java.util.ConcurrentModificationException: KafkaConsumer is not safe for > multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1132) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:360) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:951) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala
[jira] [Assigned] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23152: Assignee: Apache Spark > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Assignee: Apache Spark >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} > > The problem happens due to an incorrect guard condition in function > getNumClasses at org.apache.spark.ml.classification.Classifier:106 > {code:java} > val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) > if (maxLabelRow.isEmpty) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > When the input data is empty the result "maxLabelRow" array is not. Instead > it contains a single Row(null) element. > > Proposed solution: the condition can be modified to verify that. > {code:java} > if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23152: Assignee: (was: Apache Spark) > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} > > The problem happens due to an incorrect guard condition in function > getNumClasses at org.apache.spark.ml.classification.Classifier:106 > {code:java} > val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) > if (maxLabelRow.isEmpty) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > When the input data is empty the result "maxLabelRow" array is not. Instead > it contains a single Row(null) element. > > Proposed solution: the condition can be modified to verify that. > {code:java} > if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331194#comment-16331194 ] Apache Spark commented on SPARK-23152: -- User 'tovbinm' has created a pull request for this issue: https://github.com/apache/spark/pull/20321 > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} > > The problem happens due to an incorrect guard condition in function > getNumClasses at org.apache.spark.ml.classification.Classifier:106 > {code:java} > val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) > if (maxLabelRow.isEmpty) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > When the input data is empty the result "maxLabelRow" array is not. Instead > it contains a single Row(null) element. > > Proposed solution: the condition can be modified to verify that. > {code:java} > if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { > throw new SparkException("ML algorithm was given empty dataset.") > } > {code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22962: Assignee: (was: Apache Spark) > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331185#comment-16331185 ] Apache Spark commented on SPARK-22962: -- User 'liyinan926' has created a pull request for this issue: https://github.com/apache/spark/pull/20320 > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22962: Assignee: Apache Spark > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23144) Add console sink for continuous queries
[ https://issues.apache.org/jira/browse/SPARK-23144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23144. --- Resolution: Fixed Fix Version/s: 2.3.0 3.0.0 Issue resolved by pull request 20311 [https://github.com/apache/spark/pull/20311] > Add console sink for continuous queries > --- > > Key: SPARK-23144 > URL: https://issues.apache.org/jira/browse/SPARK-23144 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 3.0.0, 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a misleading NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in function getNumClasses at org.apache.spark.ml.classification.Classifier:106 {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the result "maxLabelRow" array is not. Instead it contains a single Row(null) element. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a misleading NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the result "maxLabelRow" array is not. Instead it contains a single Row(null) element. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a misleading NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the result "maxLabelRow" array is not. Instead it contains a single Row(null) element. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the result "maxLabelRow" array is not. Instead it contains a single Row(null) element. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a misleading > NullPointerException is thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.De
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the result "maxLabelRow" array is not. Instead it contains a single Row(null) element. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClass
[jira] [Assigned] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
[ https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22884: Assignee: (was: Apache Spark) > ML test for StructuredStreaming: spark.ml.clustering > > > Key: SPARK-22884 > URL: https://issues.apache.org/jira/browse/SPARK-22884 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || maxLabelRow(0).get(0) == null) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTree
[jira] [Commented] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
[ https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331157#comment-16331157 ] Apache Spark commented on SPARK-22884: -- User 'smurakozi' has created a pull request for this issue: https://github.com/apache/spark/pull/20319 > ML test for StructuredStreaming: spark.ml.clustering > > > Key: SPARK-22884 > URL: https://issues.apache.org/jira/browse/SPARK-22884 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
[ https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22884: Assignee: Apache Spark > ML test for StructuredStreaming: spark.ml.clustering > > > Key: SPARK-22884 > URL: https://issues.apache.org/jira/browse/SPARK-22884 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331153#comment-16331153 ] Anirudh Ramanathan commented on SPARK-22962: I think this isn't super critical for this release, mostly a usability thing. If it's small enough, it makes sense, but if it introduces risk and we have to redo manual testing, I'd vote against getting this into 2.3. > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23143) Add Python support for continuous trigger
[ https://issues.apache.org/jira/browse/SPARK-23143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23143. --- Resolution: Fixed Fix Version/s: 2.3.0 3.0.0 Issue resolved by pull request 20309 [https://github.com/apache/spark/pull/20309] > Add Python support for continuous trigger > - > > Key: SPARK-23143 > URL: https://issues.apache.org/jira/browse/SPARK-23143 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 3.0.0, 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || (maxLabelRow.size == 1 && maxLabelRow(0) == null)) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifie
[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331141#comment-16331141 ] Michael Armbrust commented on SPARK-20928: -- There is more work to do so I might leave the umbrella open, but we are going to have an initial version that supports reading and writing from/to Kafka with very low latency in Spark 2.3! Stay tuned for updates to docs and blog posts. > SPIP: Continuous Processing Mode for Structured Streaming > - > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Priority: Major > Labels: SPIP > Attachments: Continuous Processing in Structured Streaming Design > Sketch.pdf > > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331140#comment-16331140 ] Marcelo Vanzin commented on SPARK-22962: Yes send a PR. > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331132#comment-16331132 ] Yinan Li commented on SPARK-22962: -- I agree that before we upstream the staging server, we should fail the submission if a user uses local resources. [~vanzin], if it's not too late to get into 2.3, I'm gonna file a PR for this. > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. Instead it contains a single null element WrappedArray(null). Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.isEmpty || (maxLabelRow.size == 1 && maxLabelRow(0) == null)) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(De
[jira] [Updated] (SPARK-23146) Support client mode for Kubernetes cluster backend
[ https://issues.apache.org/jira/browse/SPARK-23146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan updated SPARK-23146: --- Target Version/s: 2.4.0 > Support client mode for Kubernetes cluster backend > -- > > Key: SPARK-23146 > URL: https://issues.apache.org/jira/browse/SPARK-23146 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Major > > This issue tracks client mode support within Spark when running in the > Kubernetes cluster backend. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan updated SPARK-22962: --- Affects Version/s: (was: 2.4.0) 2.3.0 > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23146) Support client mode for Kubernetes cluster backend
[ https://issues.apache.org/jira/browse/SPARK-23146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan updated SPARK-23146: --- Affects Version/s: (was: 2.4.0) 2.3.0 > Support client mode for Kubernetes cluster backend > -- > > Key: SPARK-23146 > URL: https://issues.apache.org/jira/browse/SPARK-23146 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Major > > This issue tracks client mode support within Spark when running in the > Kubernetes cluster backend. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan updated SPARK-22962: --- Affects Version/s: (was: 2.3.0) 2.4.0 > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22962) Kubernetes app fails if local files are used
[ https://issues.apache.org/jira/browse/SPARK-22962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331085#comment-16331085 ] Anirudh Ramanathan commented on SPARK-22962: This is the resource staging server use-case. We'll upstream this in the 2.4.0 timeframe. > Kubernetes app fails if local files are used > > > Key: SPARK-22962 > URL: https://issues.apache.org/jira/browse/SPARK-22962 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Marcelo Vanzin >Priority: Major > > If you try to start a Spark app on kubernetes using a local file as the app > resource, for example, it will fail: > {code} > ./bin/spark-submit [[bunch of arguments]] /path/to/local/file.jar > {code} > {noformat} > + /sbin/tini -s -- /bin/sh -c 'SPARK_CLASSPATH="${SPARK_HOME}/jars/*" && > env | grep SPARK_JAVA_OPT_ | sed '\''s/[^=]*=\(.*\)/\1/g' > \'' > /tmp/java_opts.txt && readarray -t SPARK_DRIVER_JAVA_OPTS < > /tmp/java_opts.txt && if ! [ -z ${SPARK_MOUNTED_CLASSPATH+x} > ]; then SPARK_CLASSPATH="$SPARK_MOUNTED_CLASSPATH:$SPARK_CLASSPATH"; fi && > if ! [ -z ${SPARK_SUBMIT_EXTRA_CLASSPATH+x} ]; then SP > ARK_CLASSPATH="$SPARK_SUBMIT_EXTRA_CLASSPATH:$SPARK_CLASSPATH"; fi && if > ! [ -z ${SPARK_MOUNTED_FILES_DIR+x} ]; then cp -R "$SPARK > _MOUNTED_FILES_DIR/." .; fi && ${JAVA_HOME}/bin/java > "${SPARK_DRIVER_JAVA_OPTS[@]}" -cp "$SPARK_CLASSPATH" -Xms$SPARK_DRIVER_MEMOR > Y -Xmx$SPARK_DRIVER_MEMORY > -Dspark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS $SPARK_DRIVER_CLASS > $SPARK_DRIVER_ARGS' > Error: Could not find or load main class com.cloudera.spark.tests.Sleeper > {noformat} > Using an http server to provide the app jar solves the problem. > The k8s backend should either somehow make these files available to the > cluster or error out with a more user-friendly message if that feature is not > yet available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23016) Spark UI access and documentation
[ https://issues.apache.org/jira/browse/SPARK-23016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anirudh Ramanathan resolved SPARK-23016. Resolution: Fixed This is resolved and we've verified it for 2.3.0. > Spark UI access and documentation > - > > Key: SPARK-23016 > URL: https://issues.apache.org/jira/browse/SPARK-23016 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Minor > > We should have instructions to access the spark driver UI, or instruct users > to create a service to expose it. > Also might need an integration test to verify that the driver UI works as > expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331081#comment-16331081 ] Anirudh Ramanathan edited comment on SPARK-23082 at 1/18/18 7:52 PM: - This is an interesting feature request. We had some discussion about this. I think it's a bit too late for a feature request in 2.3, so, we can revisit this in the 2.4 timeframe. [~mcheah] was (Author: foxish): This is an interesting feature request. We had some discussion about this. I think it's a bit too late for a feature request in 2.3, so, we can revisit this in the 2.4 timeframe. > Allow separate node selectors for driver and executors in Kubernetes > > > Key: SPARK-23082 > URL: https://issues.apache.org/jira/browse/SPARK-23082 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit >Affects Versions: 2.2.0, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark > driver to a different set of nodes from its executors. In Kubernetes, we can > specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use > separate options for the driver and executors. > This would be useful for the particular use case where executors can go on > more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot > instances), but the driver should use a more persistent machine. > The required change would be minimal, essentially just using different config > keys for the > [driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90] > and > [executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73] > instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23082) Allow separate node selectors for driver and executors in Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-23082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331081#comment-16331081 ] Anirudh Ramanathan commented on SPARK-23082: This is an interesting feature request. We had some discussion about this. I think it's a bit too late for a feature request in 2.3, so, we can revisit this in the 2.4 timeframe. > Allow separate node selectors for driver and executors in Kubernetes > > > Key: SPARK-23082 > URL: https://issues.apache.org/jira/browse/SPARK-23082 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Submit >Affects Versions: 2.2.0, 2.3.0 >Reporter: Oz Ben-Ami >Priority: Minor > > In YARN, we can use {{spark.yarn.am.nodeLabelExpression}} to submit the Spark > driver to a different set of nodes from its executors. In Kubernetes, we can > specify {{spark.kubernetes.node.selector.[labelKey]}}, but we can't use > separate options for the driver and executors. > This would be useful for the particular use case where executors can go on > more ephemeral nodes (eg, with cluster autoscaling, or preemptible/spot > instances), but the driver should use a more persistent machine. > The required change would be minimal, essentially just using different config > keys for the > [driver|https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/steps/BasicDriverConfigurationStep.scala#L90] > and > [executor|https://github.com/apache/spark/blob/0b2eefb674151a0af64806728b38d9410da552ec/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodFactory.scala#L73] > instead of {{KUBERNETES_NODE_SELECTOR_PREFIX}} for both. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23133) Spark options are not passed to the Executor in Docker context
[ https://issues.apache.org/jira/browse/SPARK-23133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331074#comment-16331074 ] Anirudh Ramanathan commented on SPARK-23133: Thanks for submitting the PR fixing this. > Spark options are not passed to the Executor in Docker context > -- > > Key: SPARK-23133 > URL: https://issues.apache.org/jira/browse/SPARK-23133 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 > Environment: Running Spark on K8s using supplied Docker image. >Reporter: Andrew Korzhuev >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > Reproduce: > # Build image with `bin/docker-image-tool.sh`. > # Submit application to k8s. Set executor options, e.g. ` --conf > "spark.executor. > extraJavaOptions=-Djava.security.auth.login.config=./jaas.conf"` > # Visit Spark UI on executor and notice that option is not set. > Expected behavior: options from spark-submit should be correctly passed to > executor. > Cause: > `SPARK_EXECUTOR_JAVA_OPTS` is not defined in `entrypoint.sh` > https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L70 > [https://github.com/apache/spark/blob/branch-2.3/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh#L44-L45] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier().setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Proposed solution: the condition can be modified to verify that. {code:java} if (maxLabelRow.size <= 1) { throw new SparkException("ML algorithm was given empty dataset.") } {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Therefore the condition has to be modified to verify that. {code:java} maxLabelRow.size <= 1 {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier() > .setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45)
[jira] [Commented] (SPARK-20928) SPIP: Continuous Processing Mode for Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-20928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331045#comment-16331045 ] Thomas Graves commented on SPARK-20928: --- what is status of this, it looks like subtasks are finished? > SPIP: Continuous Processing Mode for Structured Streaming > - > > Key: SPARK-20928 > URL: https://issues.apache.org/jira/browse/SPARK-20928 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Michael Armbrust >Priority: Major > Labels: SPIP > Attachments: Continuous Processing in Structured Streaming Design > Sketch.pdf > > > Given the current Source API, the minimum possible latency for any record is > bounded by the amount of time that it takes to launch a task. This > limitation is a result of the fact that {{getBatch}} requires us to know both > the starting and the ending offset, before any tasks are launched. In the > worst case, the end-to-end latency is actually closer to the average batch > time + task launching time. > For applications where latency is more important than exactly-once output > however, it would be useful if processing could happen continuously. This > would allow us to achieve fully pipelined reading and writing from sources > such as Kafka. This kind of architecture would make it possible to process > records with end-to-end latencies on the order of 1 ms, rather than the > 10-100ms that is possible today. > One possible architecture here would be to change the Source API to look like > the following rough sketch: > {code} > trait Epoch { > def data: DataFrame > /** The exclusive starting position for `data`. */ > def startOffset: Offset > /** The inclusive ending position for `data`. Incrementally updated > during processing, but not complete until execution of the query plan in > `data` is finished. */ > def endOffset: Offset > } > def getBatch(startOffset: Option[Offset], endOffset: Option[Offset], > limits: Limits): Epoch > {code} > The above would allow us to build an alternative implementation of > {{StreamExecution}} that processes continuously with much lower latency and > only stops processing when needing to reconfigure the stream (either due to a > failure or a user requested change in parallelism. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
Matthew Tovbin created SPARK-23152: -- Summary: Invalid guard condition in org.apache.spark.ml.classification.Classifier Key: SPARK-23152 URL: https://issues.apache.org/jira/browse/SPARK-23152 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.1.2, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0, 2.1.3, 2.3.0, 2.3.1 Reporter: Matthew Tovbin When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Therefore the condition has to be modified to verify that. {code:java} maxLabelRow.size <= 1 {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23152) Invalid guard condition in org.apache.spark.ml.classification.Classifier
[ https://issues.apache.org/jira/browse/SPARK-23152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Tovbin updated SPARK-23152: --- Description: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Therefore the condition has to be modified to verify that. {code:java} maxLabelRow.size <= 1 {code} was: When fitting a classifier that extends "org.apache.spark.ml.classification.Classifier" (NaiveBayes, DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is thrown. Steps to reproduce: {code:java} val data = spark.createDataset(Seq.empty[(Double, org.apache.spark.ml.linalg.Vector)]) new DecisionTreeClassifier() .setLabelCol("_1").setFeaturesCol("_2").fit(data) {code} The error: {code:java} java.lang.NullPointerException: Value at index 0 is null at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) at org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) at org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) at org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} The problem happens due to an incorrect guard condition in org.apache.spark.ml.classification.Classifier:getNumClasses {code:java} val maxLabelRow: Array[Row] = dataset.select(max($(labelCol))).take(1) if (maxLabelRow.isEmpty) { throw new SparkException("ML algorithm was given empty dataset.") } {code} When the input data is empty the "maxLabelRow" array is not. It contains a single element with no columns in it. Therefore the condition has to be modified to verify that. {code:java} maxLabelRow.size <= 1 {code} > Invalid guard condition in org.apache.spark.ml.classification.Classifier > > > Key: SPARK-23152 > URL: https://issues.apache.org/jira/browse/SPARK-23152 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.3.0, > 2.3.1 >Reporter: Matthew Tovbin >Priority: Minor > Labels: easyfix > > When fitting a classifier that extends > "org.apache.spark.ml.classification.Classifier" (NaiveBayes, > DecisionTreeClassifier, RandomForestClassifier) a NullPointerException is > thrown. > Steps to reproduce: > {code:java} > val data = spark.createDataset(Seq.empty[(Double, > org.apache.spark.ml.linalg.Vector)]) > new DecisionTreeClassifier() > .setLabelCol("_1").setFeaturesCol("_2").fit(data) > {code} > The error: > {code:java} > java.lang.NullPointerException: Value at index 0 is null > at org.apache.spark.sql.Row$class.getAnyValAs(Row.scala:472) > at org.apache.spark.sql.Row$class.getDouble(Row.scala:248) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.getDouble(rows.scala:165) > at > org.apache.spark.ml.classification.Classifier.getNumClasses(Classifier.scala:115) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:102) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:45) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:118){code} >
[jira] [Assigned] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified
[ https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-23029: - Assignee: Fernando Pereira > Doc spark.shuffle.file.buffer units are kb when no units specified > -- > > Key: SPARK-23029 > URL: https://issues.apache.org/jira/browse/SPARK-23029 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Fernando Pereira >Assignee: Fernando Pereira >Priority: Minor > Fix For: 2.3.0 > > > When setting the spark.shuffle.file.buffer setting, even to its default > value, shuffles fail. > This appears to affect small to medium size partitions. Strangely the error > message is OutOfMemoryError, but it works with large partitions (at least > >32MB). > {code} > pyspark --conf "spark.shuffle.file.buffer=$((32*1024))" > /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit > pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768 > version 2.2.1 > >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", > >>> mode="overwrite") > [Stage 1:>(0 + 10) / > 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 11) > java.lang.OutOfMemoryError: Java heap space > at java.io.BufferedOutputStream.(BufferedOutputStream.java:75) > at > org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107) > at > org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23029) Doc spark.shuffle.file.buffer units are kb when no units specified
[ https://issues.apache.org/jira/browse/SPARK-23029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23029. --- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20269 [https://github.com/apache/spark/pull/20269] > Doc spark.shuffle.file.buffer units are kb when no units specified > -- > > Key: SPARK-23029 > URL: https://issues.apache.org/jira/browse/SPARK-23029 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Fernando Pereira >Assignee: Fernando Pereira >Priority: Minor > Fix For: 2.3.0 > > > When setting the spark.shuffle.file.buffer setting, even to its default > value, shuffles fail. > This appears to affect small to medium size partitions. Strangely the error > message is OutOfMemoryError, but it works with large partitions (at least > >32MB). > {code} > pyspark --conf "spark.shuffle.file.buffer=$((32*1024))" > /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit > pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768 > version 2.2.1 > >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", > >>> mode="overwrite") > [Stage 1:>(0 + 10) / > 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID > 11) > java.lang.OutOfMemoryError: Java heap space > at java.io.BufferedOutputStream.(BufferedOutputStream.java:75) > at > org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107) > at > org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-13572) HiveContext reads avro Hive tables incorrectly
[ https://issues.apache.org/jira/browse/SPARK-13572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-13572: --- Comment: was deleted (was: Guten Tag, Bonjour, Buongiorno Besten Dank für Ihr E-Mail. Ich bin zurzeit nicht am Arbeitsplatz und kehre am 27.06.2016 zurück. Ihr E-Mail wird nicht weitergeleitet. Merci pour votre courriel. Je suis actuellement absent(e) et serai de retour le 27.06.2016. Votre courriel ne sera pas dévié. La ringrazio per la sua e-mail. Al momento sono assente. Sarò di ritorno il 27.06.2016. Il suo messaggio non è inoltrato a terze persone. Freundliche Grüsse Avec mes meilleures salutations Cordiali saluti Russ Weedon Professional ERP Engineer SBB AG SBB Informatik Solution Center Finanzen, K-SCM, HR und Immobilien Lindenhofstrasse 1, 3000 Bern 65 Mobil +41 78 812 47 62 russ.wee...@sbb.ch / www.sbb.ch ) > HiveContext reads avro Hive tables incorrectly > --- > > Key: SPARK-13572 > URL: https://issues.apache.org/jira/browse/SPARK-13572 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.2, 1.6.0, 1.6.1 > Environment: Hive 0.13.1, Spark 1.5.2 >Reporter: Zoltan Fedor >Priority: Major > Attachments: logs, table_definition > > > I am using PySpark to read avro-based tables from Hive and while the avro > tables can be read, some of the columns are incorrectly read - showing value > {{None}} instead of the actual value. > {noformat} > >>> results_df = sqlContext.sql("""SELECT * FROM trmdw_prod.opsconsole_ingest > >>> where year=2016 and month=2 and day=29 limit 3""") > >>> results_df.take(3) > [Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, > uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, > statvalue=None, displayname=None, category=None, > source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29), > Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, > uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, > statvalue=None, displayname=None, category=None, > source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29), > Row(kafkaoffsetgeneration=None, kafkapartition=None, kafkaoffset=None, > uuid=None, mid=None, iid=None, product=None, utctime=None, statcode=None, > statvalue=None, displayname=None, category=None, > source_filename=u'ops-20160228_23_35_01.gz', year=2016, month=2, day=29)] > {noformat} > Observe the {{None}} values at most of the fields. Surprisingly not all > fields, only some of them are showing {{None}} instead of the real values. > The table definition does not show anything specific about these columns. > Running the same query in Hive: > {noformat} > c:hive2://xyz.com:100> SELECT * FROM trmdw_prod.opsconsole_ingest where > year=2016 and month=2 and day=29 limit 3; > +--+---++---+---+---+++-+--++-++-+--++--+ > | opsconsole_ingest.kafkaoffsetgeneration | opsconsole_ingest.kafkapartition > | opsconsole_ingest.kafkaoffset | opsconsole_ingest.uuid | > opsconsole_ingest.mid | opsconsole_ingest.iid | > opsconsole_ingest.product | opsconsole_ingest.utctime | > opsconsole_ingest.statcode | opsconsole_ingest.statvalue | > opsconsole_ingest.displayname | opsconsole_ingest.category | > opsconsole_ingest.source_filename | opsconsole_ingest.year | > opsconsole_ingest.month | opsconsole_ingest.day | > +--+---++---+---+---+++-+--++-++-+--++--+ > | 11.0 | 0.0 > | 3.83399394E8 | EF0D03C409681B98646F316CA1088973 | > 174f53fb-ca9b-d3f9-64e1-7631bf906817 | ---- > | est| 2016-01-13T06:58:19| 8 >
[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel
[ https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330943#comment-16330943 ] Andrew Ash commented on SPARK-22982: [~joshrosen] do you have some example stacktraces of what this bug can cause? Several of our clusters hit what I think is this problem earlier this month, see below for details. For a few days in January (4th through 12th) on our AWS infra, we observed massively degraded disk read throughput (down to 33% of previous peaks). During this time, we also began observing intermittent exceptions coming from Spark at read time of parquet files that a previous Spark job had written. When the read throughput recovered on the 12th, we stopped observing the exceptions and haven't seen them since. At first we observed this stacktrace when reading .snappy.parquet files: {noformat} java.lang.RuntimeException: java.io.IOException: could not read page Page [bytes.size=1048641, valueCount=29945, uncompressedSize=1048641] in col [my_column] BINARY at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:493) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:486) at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:96) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:486) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:157) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:229) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:398) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191) at org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) at org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: could not read page Page [bytes.size=1048641, valueCount=29945, uncompressedSize=1048641] in col [my_column] BINARY at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV1(VectorizedColumnReader.java:562) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$000(VectorizedColumnReader.java:47) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:490) ... 31 more Caused by: java.io
[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23147: -- Assignee: Saisai Shao > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Fix For: 2.3.0 > > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23147. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20315 [https://github.com/apache/spark/pull/20315] > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Major > Fix For: 2.3.0 > > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23151) Provide a distribution of Spark with Hadoop 3.0
[ https://issues.apache.org/jira/browse/SPARK-23151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Louis Burke updated SPARK-23151: Summary: Provide a distribution of Spark with Hadoop 3.0 (was: Provide a distribution Spark with Hadoop 3.0) > Provide a distribution of Spark with Hadoop 3.0 > --- > > Key: SPARK-23151 > URL: https://issues.apache.org/jira/browse/SPARK-23151 > Project: Spark > Issue Type: Dependency upgrade > Components: Spark Core >Affects Versions: 2.2.0, 2.2.1 >Reporter: Louis Burke >Priority: Major > > Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark > package > only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication > is > that using up to date Kinesis libraries alongside s3 causes a clash w.r.t > aws-java-sdk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23151) Provide a distribution Spark with Hadoop 3.0
Louis Burke created SPARK-23151: --- Summary: Provide a distribution Spark with Hadoop 3.0 Key: SPARK-23151 URL: https://issues.apache.org/jira/browse/SPARK-23151 Project: Spark Issue Type: Dependency upgrade Components: Spark Core Affects Versions: 2.2.1, 2.2.0 Reporter: Louis Burke Provide a Spark package that supports Hadoop 3.0.0. Currently the Spark package only supports Hadoop 2.7 i.e. spark-2.2.1-bin-hadoop2.7.tgz. The implication is that using up to date Kinesis libraries alongside s3 causes a clash w.r.t aws-java-sdk. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23150) kafka job failing assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic
[ https://issues.apache.org/jira/browse/SPARK-23150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23150. --- Resolution: Invalid > kafka job failing assertion failed: Beginning offset 5451 is after the ending > offset 5435 for topic > --- > > Key: SPARK-23150 > URL: https://issues.apache.org/jira/browse/SPARK-23150 > Project: Spark > Issue Type: Question > Components: Build >Affects Versions: 0.9.0 > Environment: Spark Job failing with below error. Please help > Kafka version 0.10.0.0 > Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most > recent failure: Lost task 3.3 in stage 2.0 (TID 16, 10.3.12.16): > java.lang.AssertionError: assertion failed: Beginning offset 5451 is after > the ending offset 5435 for topic project1 partition 2. You either provided an > invalid fromOffset, or the Kafka topic has been damaged at > scala.Predef$.assert(Predef.scala:179) at > org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:86) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at > org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at > org.apache.spark.scheduler.Task.run(Task.scala:70) at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) Driver stacktrace: >Reporter: Balu >Priority: Critical > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23150) kafka job failing assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic
Balu created SPARK-23150: Summary: kafka job failing assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic Key: SPARK-23150 URL: https://issues.apache.org/jira/browse/SPARK-23150 Project: Spark Issue Type: Question Components: Build Affects Versions: 0.9.0 Environment: Spark Job failing with below error. Please help Kafka version 0.10.0.0 Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 16, 10.3.12.16): java.lang.AssertionError: assertion failed: Beginning offset 5451 is after the ending offset 5435 for topic project1 partition 2. You either provided an invalid fromOffset, or the Kafka topic has been damaged at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.streaming.kafka.KafkaRDD.compute(KafkaRDD.scala:86) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Driver stacktrace: Reporter: Balu -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23149) polish ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23149: Assignee: Wenchen Fan (was: Apache Spark) > polish ColumnarBatch > > > Key: SPARK-23149 > URL: https://issues.apache.org/jira/browse/SPARK-23149 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23149) polish ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330610#comment-16330610 ] Apache Spark commented on SPARK-23149: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20316 > polish ColumnarBatch > > > Key: SPARK-23149 > URL: https://issues.apache.org/jira/browse/SPARK-23149 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23149) polish ColumnarBatch
[ https://issues.apache.org/jira/browse/SPARK-23149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23149: Assignee: Apache Spark (was: Wenchen Fan) > polish ColumnarBatch > > > Key: SPARK-23149 > URL: https://issues.apache.org/jira/browse/SPARK-23149 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23149) polish ColumnarBatch
Wenchen Fan created SPARK-23149: --- Summary: polish ColumnarBatch Key: SPARK-23149 URL: https://issues.apache.org/jira/browse/SPARK-23149 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19280) Failed Recovery from checkpoint caused by the multi-threads issue in Spark Streaming scheduler
[ https://issues.apache.org/jira/browse/SPARK-19280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330558#comment-16330558 ] Riccardo Vincelli edited comment on SPARK-19280 at 1/18/18 2:20 PM: What is the exception you are getting? For me it is: This RDD lacks a SparkContext. It could happen in the following cases: (1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. (2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758. Not sure yet if 100% related but our program: -reads Kafka topics -uses mapWithState on an initialState Thanks, was (Author: rvincelli): What is the exception you are getting? For me it is: This RDD lacks a SparkContext. It could happen in the following cases: (1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. (2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758. Not sure yet if 100% related. Thanks, > Failed Recovery from checkpoint caused by the multi-threads issue in Spark > Streaming scheduler > -- > > Key: SPARK-19280 > URL: https://issues.apache.org/jira/browse/SPARK-19280 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nan Zhu >Priority: Critical > > In one of our applications, we found the following issue, the application > recovering from a checkpoint file named "checkpoint-***16670" but with > the timestamp ***16650 will recover from the very beginning of the stream > and because our application relies on the external & periodically-cleaned > data (syncing with checkpoint cleanup), the recovery just failed > We identified a potential issue in Spark Streaming checkpoint and will > describe it with the following example. We will propose a fix in the end of > this JIRA. > 1. The application properties: Batch Duration: 2, Functionality: Single > Stream calling ReduceByKeyAndWindow and print, Window Size: 6, > SlideDuration, 2 > 2. RDD at 16650 is generated and the corresponding job is submitted to > the execution ThreadPool. Meanwhile, a DoCheckpoint message is sent to the > queue of JobGenerator > 3. Job at 16650 is finished and JobCompleted message is sent to > JobScheduler's queue, and meanwhile, Job at 16652 is submitted to the > execution ThreadPool and similarly, a DoCheckpoint is sent to the queue of > JobGenerator > 4. JobScheduler's message processing thread (I will use JS-EventLoop to > identify it) is not scheduled by the operating system for a long time, and > during this period, Jobs generated from 16652 - 16670 are generated > and completed. > 5. the message processing thread of JobGenerator (JG-EventLoop) is scheduled > and processed all DoCheckpoint messages for jobs ranging from 16652 - > 16670 and checkpoint files are successfully written. CRITICAL: at this > moment, the lastCheckpointTime would be 16670. > 6. JS-EventLoop is scheduled, and process all JobCompleted messages for jobs > ranging from 16652 - 16670. CRITICAL: a ClearMetadata message is > pushed to JobGenerator's message queue for EACH JobCompleted. > 7. The current message queue contains 20 ClearMetadata messages and > JG-EventLoop is scheduled to process them. CRITICAL: ClearMetadata will > remove all RDDs out of rememberDuration window. In our case, > ReduceyKeyAndWindow will set rememberDuration to 10 (rememberDuration of > ReducedWindowDStream (4) + windowSize) resulting that only RDDs <- > (16660, 16670] are kept. And ClearMetaData processing logic will push > a DoCheckpoint to JobGenerator's thread > 8. JG-EventLoop is scheduled again to process DoCheckpoint for 1665, VERY > CRITICAL: at this step, RDD no later than 16660 has been removed, and > checkpoint data is updated as > https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/sp
[jira] [Commented] (SPARK-19280) Failed Recovery from checkpoint caused by the multi-threads issue in Spark Streaming scheduler
[ https://issues.apache.org/jira/browse/SPARK-19280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330558#comment-16330558 ] Riccardo Vincelli commented on SPARK-19280: --- What is the exception you are getting? For me it is: This RDD lacks a SparkContext. It could happen in the following cases: (1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063. (2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758. Not sure yet if 100% related. Thanks, > Failed Recovery from checkpoint caused by the multi-threads issue in Spark > Streaming scheduler > -- > > Key: SPARK-19280 > URL: https://issues.apache.org/jira/browse/SPARK-19280 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0 >Reporter: Nan Zhu >Priority: Critical > > In one of our applications, we found the following issue, the application > recovering from a checkpoint file named "checkpoint-***16670" but with > the timestamp ***16650 will recover from the very beginning of the stream > and because our application relies on the external & periodically-cleaned > data (syncing with checkpoint cleanup), the recovery just failed > We identified a potential issue in Spark Streaming checkpoint and will > describe it with the following example. We will propose a fix in the end of > this JIRA. > 1. The application properties: Batch Duration: 2, Functionality: Single > Stream calling ReduceByKeyAndWindow and print, Window Size: 6, > SlideDuration, 2 > 2. RDD at 16650 is generated and the corresponding job is submitted to > the execution ThreadPool. Meanwhile, a DoCheckpoint message is sent to the > queue of JobGenerator > 3. Job at 16650 is finished and JobCompleted message is sent to > JobScheduler's queue, and meanwhile, Job at 16652 is submitted to the > execution ThreadPool and similarly, a DoCheckpoint is sent to the queue of > JobGenerator > 4. JobScheduler's message processing thread (I will use JS-EventLoop to > identify it) is not scheduled by the operating system for a long time, and > during this period, Jobs generated from 16652 - 16670 are generated > and completed. > 5. the message processing thread of JobGenerator (JG-EventLoop) is scheduled > and processed all DoCheckpoint messages for jobs ranging from 16652 - > 16670 and checkpoint files are successfully written. CRITICAL: at this > moment, the lastCheckpointTime would be 16670. > 6. JS-EventLoop is scheduled, and process all JobCompleted messages for jobs > ranging from 16652 - 16670. CRITICAL: a ClearMetadata message is > pushed to JobGenerator's message queue for EACH JobCompleted. > 7. The current message queue contains 20 ClearMetadata messages and > JG-EventLoop is scheduled to process them. CRITICAL: ClearMetadata will > remove all RDDs out of rememberDuration window. In our case, > ReduceyKeyAndWindow will set rememberDuration to 10 (rememberDuration of > ReducedWindowDStream (4) + windowSize) resulting that only RDDs <- > (16660, 16670] are kept. And ClearMetaData processing logic will push > a DoCheckpoint to JobGenerator's thread > 8. JG-EventLoop is scheduled again to process DoCheckpoint for 1665, VERY > CRITICAL: at this step, RDD no later than 16660 has been removed, and > checkpoint data is updated as > https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L53 > and > https://github.com/apache/spark/blob/a81e336f1eddc2c6245d807aae2c81ddc60eabf9/streaming/src/main/scala/org/apache/spark/streaming/dstream/DStreamCheckpointData.scala#L59. > 9. After 8, a checkpoint named /path/checkpoint-16670 is created but with > the timestamp 16650. and at this moment, Application crashed > 10. Application recovers from /path/checkpoint-16670 and try to get RDD > with validTime 16650. Of course it will not find it and has to recompute. > In the case of ReduceByKeyAndWindow, it needs to recursively slice RDDs until > to the start of the stream. When the stream depends on the external data, it > will not successfully recover. In the case of Kafka, the recovered RDDs would > not be the same as the original one, as the currentOffsets has been u
[jira] [Resolved] (SPARK-23141) Support data type string as a returnType for registerJavaFunction.
[ https://issues.apache.org/jira/browse/SPARK-23141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23141. -- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20307 [https://github.com/apache/spark/pull/20307] > Support data type string as a returnType for registerJavaFunction. > -- > > Key: SPARK-23141 > URL: https://issues.apache.org/jira/browse/SPARK-23141 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.3.0 > > > Currently {{UDFRegistration.registerJavaFunction}} doesn't support data type > string as a returnType whereas {{UDFRegistration.register}}, {{@udf}}, or > {{@pandas_udf}} does. > We can support it for {{UDFRegistration.registerJavaFunction}} as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23141) Support data type string as a returnType for registerJavaFunction.
[ https://issues.apache.org/jira/browse/SPARK-23141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-23141: Assignee: Takuya Ueshin > Support data type string as a returnType for registerJavaFunction. > -- > > Key: SPARK-23141 > URL: https://issues.apache.org/jira/browse/SPARK-23141 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 2.3.0 > > > Currently {{UDFRegistration.registerJavaFunction}} doesn't support data type > string as a returnType whereas {{UDFRegistration.register}}, {{@udf}}, or > {{@pandas_udf}} does. > We can support it for {{UDFRegistration.registerJavaFunction}} as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22036) BigDecimal multiplication sometimes returns null
[ https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22036. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20023 [https://github.com/apache/spark/pull/20023] > BigDecimal multiplication sometimes returns null > > > Key: SPARK-22036 > URL: https://issues.apache.org/jira/browse/SPARK-22036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Olivier Blanvillain >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.0 > > > The multiplication of two BigDecimal numbers sometimes returns null. Here is > a minimal reproduction: > {code:java} > object Main extends App { > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.SparkSession > import spark.implicits._ > val conf = new > SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", > "false") > val spark = > SparkSession.builder().config(conf).appName("REPL").getOrCreate() > implicit val sqlContext = spark.sqlContext > case class X2(a: BigDecimal, b: BigDecimal) > val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), > BigDecimal(-1000.1 > val result = ds.select(ds("a") * ds("b")).collect.head > println(result) // [null] > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22036) BigDecimal multiplication sometimes returns null
[ https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22036: --- Assignee: Marco Gaido > BigDecimal multiplication sometimes returns null > > > Key: SPARK-22036 > URL: https://issues.apache.org/jira/browse/SPARK-22036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Olivier Blanvillain >Assignee: Marco Gaido >Priority: Major > Fix For: 2.3.0 > > > The multiplication of two BigDecimal numbers sometimes returns null. Here is > a minimal reproduction: > {code:java} > object Main extends App { > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.SparkSession > import spark.implicits._ > val conf = new > SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", > "false") > val spark = > SparkSession.builder().config(conf).appName("REPL").getOrCreate() > implicit val sqlContext = spark.sqlContext > case class X2(a: BigDecimal, b: BigDecimal) > val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), > BigDecimal(-1000.1 > val result = ds.select(ds("a") * ds("b")).collect.head > println(result) // [null] > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330496#comment-16330496 ] Apache Spark commented on SPARK-23147: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/20315 > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23147: Assignee: (was: Apache Spark) > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23147: Assignee: Apache Spark > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Apache Spark >Priority: Major > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23148) spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces
Bogdan Raducanu created SPARK-23148: --- Summary: spark.read.csv with multiline=true gives FileNotFoundException if path contains spaces Key: SPARK-23148 URL: https://issues.apache.org/jira/browse/SPARK-23148 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Bogdan Raducanu Repro code: {code:java} spark.range(10).write.csv("/tmp/a b c/a.csv") spark.read.option("multiLine", false).csv("/tmp/a b c/a.csv").count 10 spark.read.option("multiLine", true).csv("/tmp/a b c/a.csv").count java.io.FileNotFoundException: File file:/tmp/a%20b%20c/a.csv/part-0-cf84f9b2-5fe6-4f54-a130-a1737689db00-c000.csv does not exist {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-23147: Description: Current Stage page's task table will throw an exception when there's no complete tasks, to reproduce this issue, user could submit code like: {code} sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() {code} Then open the UI and click into stage details. Below is the screenshot. !Screen Shot 2018-01-18 at 8.50.08 PM.png! Deep dive into the code, found that current UI can only show the completed tasks, it is different from 2.2 code. was: Current Stage page's task table will throw an exception when there's no complete tasks, to reproduce this issue, user could submit code like: {code} sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() {code} Then open the UI and click into stage details. Deep dive into the code, found that current UI can only show the completed tasks, it is different from 2.2 code. > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. Below is the screenshot. > !Screen Shot 2018-01-18 at 8.50.08 PM.png! > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23147) Stage page will throw exception when there's no complete tasks
Saisai Shao created SPARK-23147: --- Summary: Stage page will throw exception when there's no complete tasks Key: SPARK-23147 URL: https://issues.apache.org/jira/browse/SPARK-23147 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.3.0 Reporter: Saisai Shao Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png Current Stage page's task table will throw an exception when there's no complete tasks, to reproduce this issue, user could submit code like: {code} sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() {code} Then open the UI and click into stage details. Deep dive into the code, found that current UI can only show the completed tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23147) Stage page will throw exception when there's no complete tasks
[ https://issues.apache.org/jira/browse/SPARK-23147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-23147: Attachment: Screen Shot 2018-01-18 at 8.50.08 PM.png > Stage page will throw exception when there's no complete tasks > -- > > Key: SPARK-23147 > URL: https://issues.apache.org/jira/browse/SPARK-23147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Priority: Major > Attachments: Screen Shot 2018-01-18 at 8.50.08 PM.png > > > Current Stage page's task table will throw an exception when there's no > complete tasks, to reproduce this issue, user could submit code like: > {code} > sc.parallelize(1 to 20, 20).map { i => Thread.sleep(1); i }.collect() > {code} > Then open the UI and click into stage details. > Deep dive into the code, found that current UI can only show the completed > tasks, it is different from 2.2 code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org