[jira] [Commented] (SPARK-19337) Documentation and examples for LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-19337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848099#comment-15848099 ] yuhao yang commented on SPARK-19337: I'll start work on this if no one has started. > Documentation and examples for LinearSVC > > > Key: SPARK-19337 > URL: https://issues.apache.org/jira/browse/SPARK-19337 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > > User guide + example code for LinearSVC -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19421) Remove numClasses and numFeatures methods in LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-19421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19421: Assignee: Apache Spark > Remove numClasses and numFeatures methods in LinearSVC > -- > > Key: SPARK-19421 > URL: https://issues.apache.org/jira/browse/SPARK-19421 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > As suggested by [~holdenk], open another jira to track the following issue: > Methods {{numClasses}} and {{numFeatures}} in {{LinearSVCModel}} are already > usable by inheriting {{JavaClassificationModel}}, > we should not explicitly add them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19421) Remove numClasses and numFeatures methods in LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-19421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19421: Assignee: (was: Apache Spark) > Remove numClasses and numFeatures methods in LinearSVC > -- > > Key: SPARK-19421 > URL: https://issues.apache.org/jira/browse/SPARK-19421 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Priority: Minor > > As suggested by [~holdenk], open another jira to track the following issue: > Methods {{numClasses}} and {{numFeatures}} in {{LinearSVCModel}} are already > usable by inheriting {{JavaClassificationModel}}, > we should not explicitly add them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19421) Remove numClasses and numFeatures methods in LinearSVC
[ https://issues.apache.org/jira/browse/SPARK-19421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848063#comment-15848063 ] Apache Spark commented on SPARK-19421: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/16727 > Remove numClasses and numFeatures methods in LinearSVC > -- > > Key: SPARK-19421 > URL: https://issues.apache.org/jira/browse/SPARK-19421 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Priority: Minor > > As suggested by [~holdenk], open another jira to track the following issue: > Methods {{numClasses}} and {{numFeatures}} in {{LinearSVCModel}} are already > usable by inheriting {{JavaClassificationModel}}, > we should not explicitly add them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19421) Remove numClasses and numFeatures methods in LinearSVC
zhengruifeng created SPARK-19421: Summary: Remove numClasses and numFeatures methods in LinearSVC Key: SPARK-19421 URL: https://issues.apache.org/jira/browse/SPARK-19421 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.2.0 Reporter: zhengruifeng Priority: Minor As suggested by [~holdenk], open another jira to track the following issue: Methods {{numClasses}} and {{numFeatures}} in {{LinearSVCModel}} are already usable by inheriting {{JavaClassificationModel}}, we should not explicitly add them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19420) Confusing error message when using outer join on two large tables
[ https://issues.apache.org/jira/browse/SPARK-19420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19420: Assignee: Apache Spark (was: Xiao Li) > Confusing error message when using outer join on two large tables > - > > Key: SPARK-19420 > URL: https://issues.apache.org/jira/browse/SPARK-19420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > When users use the outer joins where both sides of the tables are too large > to be broadcasted, Spark will still select `BroadcastNestedLoopJoin`. CROSS > JOIN syntax is unable to cover the scenario of outer join, but we still issue > the following error message: > {noformat} > Use the CROSS JOIN syntax to allow cartesian products between these relations > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19420) Confusing error message when using outer join on two large tables
[ https://issues.apache.org/jira/browse/SPARK-19420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848058#comment-15848058 ] Apache Spark commented on SPARK-19420: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16762 > Confusing error message when using outer join on two large tables > - > > Key: SPARK-19420 > URL: https://issues.apache.org/jira/browse/SPARK-19420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > When users use the outer joins where both sides of the tables are too large > to be broadcasted, Spark will still select `BroadcastNestedLoopJoin`. CROSS > JOIN syntax is unable to cover the scenario of outer join, but we still issue > the following error message: > {noformat} > Use the CROSS JOIN syntax to allow cartesian products between these relations > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19419) Unable to detect all the cases of cartesian products when spark.sql.crossJoin.enabled is false
[ https://issues.apache.org/jira/browse/SPARK-19419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848057#comment-15848057 ] Apache Spark commented on SPARK-19419: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16762 > Unable to detect all the cases of cartesian products when > spark.sql.crossJoin.enabled is false > -- > > Key: SPARK-19419 > URL: https://issues.apache.org/jira/browse/SPARK-19419 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > The existing detection is unable to cover all the cartesian product cases. > For example, > - Case 1) having non-equal predicates in join conditiions of an inner join. > - Case 2) equi-join's key columns are not sortable and both sides are not > small enough for broadcasting. > This PR is to move the cross-join detection back to > `BroadcastNestedLoopJoinExec` and `CartesianProductExec`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19419) Unable to detect all the cases of cartesian products when spark.sql.crossJoin.enabled is false
[ https://issues.apache.org/jira/browse/SPARK-19419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19419: Assignee: Xiao Li (was: Apache Spark) > Unable to detect all the cases of cartesian products when > spark.sql.crossJoin.enabled is false > -- > > Key: SPARK-19419 > URL: https://issues.apache.org/jira/browse/SPARK-19419 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > The existing detection is unable to cover all the cartesian product cases. > For example, > - Case 1) having non-equal predicates in join conditiions of an inner join. > - Case 2) equi-join's key columns are not sortable and both sides are not > small enough for broadcasting. > This PR is to move the cross-join detection back to > `BroadcastNestedLoopJoinExec` and `CartesianProductExec`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19420) Confusing error message when using outer join on two large tables
[ https://issues.apache.org/jira/browse/SPARK-19420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19420: Assignee: Xiao Li (was: Apache Spark) > Confusing error message when using outer join on two large tables > - > > Key: SPARK-19420 > URL: https://issues.apache.org/jira/browse/SPARK-19420 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > When users use the outer joins where both sides of the tables are too large > to be broadcasted, Spark will still select `BroadcastNestedLoopJoin`. CROSS > JOIN syntax is unable to cover the scenario of outer join, but we still issue > the following error message: > {noformat} > Use the CROSS JOIN syntax to allow cartesian products between these relations > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19419) Unable to detect all the cases of cartesian products when spark.sql.crossJoin.enabled is false
[ https://issues.apache.org/jira/browse/SPARK-19419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19419: Assignee: Apache Spark (was: Xiao Li) > Unable to detect all the cases of cartesian products when > spark.sql.crossJoin.enabled is false > -- > > Key: SPARK-19419 > URL: https://issues.apache.org/jira/browse/SPARK-19419 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > The existing detection is unable to cover all the cartesian product cases. > For example, > - Case 1) having non-equal predicates in join conditiions of an inner join. > - Case 2) equi-join's key columns are not sortable and both sides are not > small enough for broadcasting. > This PR is to move the cross-join detection back to > `BroadcastNestedLoopJoinExec` and `CartesianProductExec`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19420) Confusing error message when using outer join on two large tables
Xiao Li created SPARK-19420: --- Summary: Confusing error message when using outer join on two large tables Key: SPARK-19420 URL: https://issues.apache.org/jira/browse/SPARK-19420 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li Assignee: Xiao Li When users use the outer joins where both sides of the tables are too large to be broadcasted, Spark will still select `BroadcastNestedLoopJoin`. CROSS JOIN syntax is unable to cover the scenario of outer join, but we still issue the following error message: {noformat} Use the CROSS JOIN syntax to allow cartesian products between these relations {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19419) Unable to detect all the cases of cartesian products when spark.sql.crossJoin.enabled is false
Xiao Li created SPARK-19419: --- Summary: Unable to detect all the cases of cartesian products when spark.sql.crossJoin.enabled is false Key: SPARK-19419 URL: https://issues.apache.org/jira/browse/SPARK-19419 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li Assignee: Xiao Li The existing detection is unable to cover all the cartesian product cases. For example, - Case 1) having non-equal predicates in join conditiions of an inner join. - Case 2) equi-join's key columns are not sortable and both sides are not small enough for broadcasting. This PR is to move the cross-join detection back to `BroadcastNestedLoopJoinExec` and `CartesianProductExec`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k
[ https://issues.apache.org/jira/browse/SPARK-19319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848051#comment-15848051 ] Apache Spark commented on SPARK-19319: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/16761 > SparkR Kmeans summary returns error when the cluster size doesn't equal to k > > > Key: SPARK-19319 > URL: https://issues.apache.org/jira/browse/SPARK-19319 > Project: Spark > Issue Type: Bug >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.2.0 > > > When Kmeans using initMode = "random" and some random seed, it is possible > the actual cluster size doesn't equal to the configured `k`. > In this case, summary(model) returns error due to the number of cols of > coefficient matrix doesn't equal to k. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19304) Kinesis checkpoint recovery is 10x slow
[ https://issues.apache.org/jira/browse/SPARK-19304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848042#comment-15848042 ] Gaurav Shah commented on SPARK-19304: - [~srowen] tried working through the code but unsure how to write clean code in this state. `KinesisBackedBlockRDD.getPartitions` returns one partition per block. But if we have one partition per block, we are unable to optimise the `getRecords` call since we have generally one sequence number per block. Tried modifying `KinesisBackedBlockRDD.getPartitions` to return one partition for validBlockIds & one partition for invalidBlockIds. That makes `KinesisBackedBlockRDDPartition` look odd since it now accepts an array of `blockIds` but is inheriting from `BlockRdd` which is one per block. Tried this path & this is working. Also got down the recovery time to same as batch processing time. Any direction where I can improve the code piece ? I can open a pull request to see the differences > Kinesis checkpoint recovery is 10x slow > --- > > Key: SPARK-19304 > URL: https://issues.apache.org/jira/browse/SPARK-19304 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 > Environment: using s3 for checkpoints using 1 executor, with 19g mem > & 3 cores per executor >Reporter: Gaurav Shah > Labels: kinesis > > Application runs fine initially, running batches of 1hour and the processing > time is less than 30 minutes on average. For some reason lets say the > application crashes, and we try to restart from checkpoint. The processing > now takes forever and does not move forward. We tried to test out the same > thing at batch interval of 1 minute, the processing runs fine and takes 1.2 > minutes for batch to finish. When we recover from checkpoint it takes about > 15 minutes for each batch. Post the recovery the batches again process at > normal speed > I suspect the KinesisBackedBlockRDD used for recovery is causing the slowdown. > Stackoverflow post with more details: > http://stackoverflow.com/questions/38390567/spark-streaming-checkpoint-recovery-is-very-very-slow -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19319) SparkR Kmeans summary returns error when the cluster size doesn't equal to k
[ https://issues.apache.org/jira/browse/SPARK-19319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-19319. -- Resolution: Fixed Assignee: Miao Wang Fix Version/s: 2.2.0 Target Version/s: 2.2.0 > SparkR Kmeans summary returns error when the cluster size doesn't equal to k > > > Key: SPARK-19319 > URL: https://issues.apache.org/jira/browse/SPARK-19319 > Project: Spark > Issue Type: Bug >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.2.0 > > > When Kmeans using initMode = "random" and some random seed, it is possible > the actual cluster size doesn't equal to the configured `k`. > In this case, summary(model) returns error due to the number of cols of > coefficient matrix doesn't equal to k. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19418) Dataset generated java code fails to compile as java.lang.Long does not accept UTF8String in constructor
[ https://issues.apache.org/jira/browse/SPARK-19418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848017#comment-15848017 ] Suresh Avadhanula commented on SPARK-19418: --- I can attach the sample code to reproduce the issue if required. > Dataset generated java code fails to compile as java.lang.Long does not > accept UTF8String in constructor > > > Key: SPARK-19418 > URL: https://issues.apache.org/jira/browse/SPARK-19418 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Suresh Avadhanula > > I have the following in Java spark driver. > DealerPerson module object is > {code:title=DealerPerson.java|borderStyle=solid} > public class DealerPerson > { > Long schemaOrgUnitId ; > List personList > } > {code} > I populate it using group by as follows. > {code} > Dataset dps = persondds.groupByKey(new MapFunction Long>() { > @Override > public Long call(Person person) throws Exception { > return person.getSchemaOrgUnitId(); > } > }, Encoders.LONG()). > mapGroups(new MapGroupsFunction() > { > @Override > public DealerPerson call(Long dp, > java.util.Iterator iterator) throws Exception { > DealerPerson retDp = new DealerPerson(); > retDp.setSchemaOrgUnitId(dp); > ArrayList persons = new ArrayList(); > while (iterator.hasNext()) > persons.add(iterator.next()); > retDp.setPersons(persons); > return retDp; > } > }, Encoders.bean(DealerPerson.class)); > {code} > The generated code throws compiler exception since UTF8String is > java.lang.Long() > {noformat} > 7/01/31 20:32:28 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, > localhost, executor driver, partition 0, PROCESS_LOCAL, 6442 bytes) > 17/01/31 20:32:28 INFO Executor: Running task 0.0 in stage 5.0 (TID 5) > 17/01/31 20:32:28 ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 58: No applicable constructor/method found for actual parameters > "org.apache.spark.unsafe.types.UTF8String"; candidates are: > "java.lang.Long(long)", "java.lang.Long(java.lang.String)" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private InternalRow mutableRow; > /* 009 */ private com.xtime.spark.model.Person javaBean; > /* 010 */ private boolean resultIsNull; > /* 011 */ private UTF8String argValue; > /* 012 */ private boolean resultIsNull1; > /* 013 */ private UTF8String argValue1; > /* 014 */ private boolean resultIsNull2; > /* 015 */ private UTF8String argValue2; > /* 016 */ private boolean resultIsNull3; > /* 017 */ private UTF8String argValue3; > /* 018 */ private boolean resultIsNull4; > /* 019 */ private UTF8String argValue4; > /* 020 */ > /* 021 */ public SpecificSafeProjection(Object[] references) { > /* 022 */ this.references = references; > /* 023 */ mutableRow = (InternalRow) references[references.length - 1]; > /* 024 */ > /* 025 */ > /* 026 */ > /* 027 */ > /* 028 */ > /* 029 */ > /* 030 */ > /* 031 */ > /* 032 */ > /* 033 */ > /* 034 */ > /* 035 */ > /* 036 */ } > /* 037 */ > /* 038 */ public void initialize(int partitionIndex) { > /* 039 */ > /* 040 */ } > /* 041 */ > /* 042 */ > /* 043 */ private void apply_4(InternalRow i) { > /* 044 */ > /* 045 */ > /* 046 */ resultIsNull1 = false; > /* 047 */ if (!resultIsNull1) { > /* 048 */ > /* 049 */ boolean isNull21 = i.isNullAt(17); > /* 050 */ UTF8String value21 = isNull21 ? null : (i.getUTF8String(17)); > /* 051 */ resultIsNull1 = isNull21; > /* 052 */ argValue1 = value21; > /* 053 */ } > /* 054 */ > /* 055 */ > /* 056 */ final java.lang.Long value20 = resultIsNull1 ? null : new > java.lang.Long(argValue1); > /* 057 */ javaBean.setSchemaOrgUnitId(value20); > /* 058 */ > /* 059 */ > /* 060 */ resultIsNull2 = false; > /* 061 */ if (!resultIsNull2) { > /* 062 */ > /* 063 */ boolean isNull23 = i.isNullAt(0); > /* 064 */ UTF8String value23 = isNull23 ? null : (i.getUTF8String(0)); > /* 065 */ resultIsNull2 = isNull23; > /* 066 */ argValue2 = value23; > /* 067 */ } > /* 068 */ > /* 069 */
[jira] [Created] (SPARK-19418) Dataset generated java code fails to compile as java.lang.Long does not accept UTF8String in constructor
Suresh Avadhanula created SPARK-19418: - Summary: Dataset generated java code fails to compile as java.lang.Long does not accept UTF8String in constructor Key: SPARK-19418 URL: https://issues.apache.org/jira/browse/SPARK-19418 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Suresh Avadhanula I have the following in Java spark driver. DealerPerson module object is {code:title=DealerPerson.java|borderStyle=solid} public class DealerPerson { Long schemaOrgUnitId ; List personList } {code} I populate it using group by as follows. {code} Dataset dps = persondds.groupByKey(new MapFunction() { @Override public Long call(Person person) throws Exception { return person.getSchemaOrgUnitId(); } }, Encoders.LONG()). mapGroups(new MapGroupsFunction() { @Override public DealerPerson call(Long dp, java.util.Iterator iterator) throws Exception { DealerPerson retDp = new DealerPerson(); retDp.setSchemaOrgUnitId(dp); ArrayList persons = new ArrayList(); while (iterator.hasNext()) persons.add(iterator.next()); retDp.setPersons(persons); return retDp; } }, Encoders.bean(DealerPerson.class)); {code} The generated code throws compiler exception since UTF8String is java.lang.Long() {noformat} 7/01/31 20:32:28 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 5, localhost, executor driver, partition 0, PROCESS_LOCAL, 6442 bytes) 17/01/31 20:32:28 INFO Executor: Running task 0.0 in stage 5.0 (TID 5) 17/01/31 20:32:28 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 58: No applicable constructor/method found for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates are: "java.lang.Long(long)", "java.lang.Long(java.lang.String)" /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificSafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private com.xtime.spark.model.Person javaBean; /* 010 */ private boolean resultIsNull; /* 011 */ private UTF8String argValue; /* 012 */ private boolean resultIsNull1; /* 013 */ private UTF8String argValue1; /* 014 */ private boolean resultIsNull2; /* 015 */ private UTF8String argValue2; /* 016 */ private boolean resultIsNull3; /* 017 */ private UTF8String argValue3; /* 018 */ private boolean resultIsNull4; /* 019 */ private UTF8String argValue4; /* 020 */ /* 021 */ public SpecificSafeProjection(Object[] references) { /* 022 */ this.references = references; /* 023 */ mutableRow = (InternalRow) references[references.length - 1]; /* 024 */ /* 025 */ /* 026 */ /* 027 */ /* 028 */ /* 029 */ /* 030 */ /* 031 */ /* 032 */ /* 033 */ /* 034 */ /* 035 */ /* 036 */ } /* 037 */ /* 038 */ public void initialize(int partitionIndex) { /* 039 */ /* 040 */ } /* 041 */ /* 042 */ /* 043 */ private void apply_4(InternalRow i) { /* 044 */ /* 045 */ /* 046 */ resultIsNull1 = false; /* 047 */ if (!resultIsNull1) { /* 048 */ /* 049 */ boolean isNull21 = i.isNullAt(17); /* 050 */ UTF8String value21 = isNull21 ? null : (i.getUTF8String(17)); /* 051 */ resultIsNull1 = isNull21; /* 052 */ argValue1 = value21; /* 053 */ } /* 054 */ /* 055 */ /* 056 */ final java.lang.Long value20 = resultIsNull1 ? null : new java.lang.Long(argValue1); /* 057 */ javaBean.setSchemaOrgUnitId(value20); /* 058 */ /* 059 */ /* 060 */ resultIsNull2 = false; /* 061 */ if (!resultIsNull2) { /* 062 */ /* 063 */ boolean isNull23 = i.isNullAt(0); /* 064 */ UTF8String value23 = isNull23 ? null : (i.getUTF8String(0)); /* 065 */ resultIsNull2 = isNull23; /* 066 */ argValue2 = value23; /* 067 */ } /* 068 */ /* 069 */ /* 070 */ final java.lang.Long value22 = resultIsNull2 ? null : new java.lang.Long(argValue2); /* 071 */ javaBean.setPersonId(value22); /* 072 */ /* 073 */ /* 074 */ boolean isNull25 = i.isNullAt(10); /* 075 */ UTF8String value25 = isNull25 ? null : (i.getUTF8String(10)); /* 076 */ boolean isNull24 = true; /* 077 */ java.lang.String value24 = null; /* 078 */ if (!isNull25) { /* 079 */ /* 080 */ isNull24 = false; /* 081 */ if (!isNull24) { /* 082 */ /* 083 */ Object funcResult8 = null; /* 084 */ funcResult8 = value2
[jira] [Updated] (SPARK-19386) Bisecting k-means in SparkR documentation
[ https://issues.apache.org/jira/browse/SPARK-19386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19386: - Target Version/s: 2.2.0 > Bisecting k-means in SparkR documentation > - > > Key: SPARK-19386 > URL: https://issues.apache.org/jira/browse/SPARK-19386 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Miao Wang > > we need updates to programming guide, example and vignettes -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19417) spark.files.overwrite is ignored
Chris Kanich created SPARK-19417: Summary: spark.files.overwrite is ignored Key: SPARK-19417 URL: https://issues.apache.org/jira/browse/SPARK-19417 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Chris Kanich I have not been able to get Spark to actually overwrite a file after I have changed it on the driver node, re-called addFile, and then used it on the executors again. Here's a failing test. {code} test("can overwrite files when spark.files.overwrite is true") { val dir = Utils.createTempDir() val file = new File(dir, "file") try { Files.write("one", file, StandardCharsets.UTF_8) sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local-cluster[1,1,1024]") .set("spark.files.overwrite", "true")) sc.addFile(file.getAbsolutePath) def getAddedFileContents(): String = { sc.parallelize(Seq(0)).map { _ => scala.io.Source.fromFile(SparkFiles.get("file")).mkString }.first() } assert(getAddedFileContents() === "one") Files.write("two", file, StandardCharsets.UTF_8) sc.addFile(file.getAbsolutePath) assert(getAddedFileContents() === "onetwo") } finally { Utils.deleteRecursively(dir) sc.stop() } } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19098) Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations
[ https://issues.apache.org/jira/browse/SPARK-19098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847913#comment-15847913 ] Steven Ruppert commented on SPARK-19098: Actually, the "fix" of checkpointing the RDDs only seems to work in spark 2.0.2. Whatever is getting into kryo-serialized form is very subtle. > Shuffled data leak/size doubling in ConnectedComponents/Pregel iterations > - > > Key: SPARK-19098 > URL: https://issues.apache.org/jira/browse/SPARK-19098 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.0.2, 2.1.0 > Environment: Linux x64 > Cloudera CDH 5.8.0 hadoop (roughly hadoop 2.7.0) > Spark on YARN, dynamic allocation with shuffle service > Input/Output data on HDFS > kryo serialization turned on > checkpointing directory set on HDFS >Reporter: Steven Ruppert >Priority: Minor > Attachments: doubling-season.png, Screen Shot 2017-01-30 at > 18.36.43-fullpage.png > > > I'm seeing a strange memory-leak-but-not-really problem in a pretty vanilla > ConnectedComponents use, notably one that works fine with identical code on > spark 2.0.1, but not on 2.1.0. > I unfortunately haven't narrowed this down to a test case yet, nor do I have > access to the original logs, so this initial report will be a little vague. > However, this behavior as described might ring a bell to somebody. > Roughly: > {noformat} > val edges: RDD[Edge[Int]] = _ // from file > val vertices: RDD[(VertexId, Int)] = _ // from file > val graph = Graph(vertices, edges) > val components: RDD[(VertexId, ComponentId)] = ConnectedComponents > .run(graph, 10) > .vertices > {noformat} > Running this against my input of ~5B edges and ~3B vertices leads to a > strange doubling of shuffle traffic in each round of Pregel (inside > ConnectedComponents), increasing from the actual data size of ~50 GB, to > 100GB, to 200GB, all the way to around 40TB before I killed the job. The data > being shuffled was apparently an RDD of ShippableVertexPartition . > Oddly enough, only the kryo-serialized shuffled data doubled in size. The > heap usage of the executors themselves remained stable, or at least did not > account 1 to 1 for the 40TB of shuffled data, for I definitely do not have > 40TB of RAM. Furthermore, I also have kryo reference tracking turned on > still, so whatever is leaking somehow gets around that. > I'll update this ticket once I have more details, unless somebody else with > the same problem reports back first. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
[ https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847889#comment-15847889 ] Jun Ye edited comment on SPARK-16599 at 2/1/17 2:25 AM: I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) {code} case class MyClass(id: Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://myS3Bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF {code} The exception only happened at the following line: {code} .map(a => MyClass(a(0).toInt, a(1))) {code} If I removed this line, there is no exception. So I changed it to the following to bypass this exception. {code} val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") {code} It works for me. was (Author: yetsun): I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) {code} case class MyClass(id: Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF {code} The exception only happened at the following line: {code} .map(a => MyClass(a(0).toInt, a(1))) {code} If I removed this line, there is no exception. So I changed it to the following to bypass this exception. {code} val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") {code} It works for me. > java.util.NoSuchElementException: None.get at at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > -- > > Key: SPARK-16599 > URL: https://issues.apache.org/jira/browse/SPARK-16599 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: centos 6.7 spark 2.0 >Reporter: binde > > run a spark job with spark 2.0, error message > Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
[ https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847889#comment-15847889 ] Jun Ye edited comment on SPARK-16599 at 2/1/17 2:25 AM: I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) {code} case class MyClass(id: Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://myS3Bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF {code} The exception only happened at the following line: {code} .map(a => MyClass(a(0).toInt, a(1))) {code} If I removed this line, there is no exception. So I changed it to the following to bypass this exception. {code} val myDF = sparkSession.sparkContext .textFile("s3a://myS3Bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") {code} It works for me. was (Author: yetsun): I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) {code} case class MyClass(id: Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://myS3Bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF {code} The exception only happened at the following line: {code} .map(a => MyClass(a(0).toInt, a(1))) {code} If I removed this line, there is no exception. So I changed it to the following to bypass this exception. {code} val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") {code} It works for me. > java.util.NoSuchElementException: None.get at at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > -- > > Key: SPARK-16599 > URL: https://issues.apache.org/jira/browse/SPARK-16599 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: centos 6.7 spark 2.0 >Reporter: binde > > run a spark job with spark 2.0, error message > Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
[ https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847889#comment-15847889 ] Jun Ye edited comment on SPARK-16599 at 2/1/17 2:24 AM: I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) {code} case class MyClass(id: Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF {code} The exception only happened at the following line: {code} .map(a => MyClass(a(0).toInt, a(1))) {code} If I removed this line, there is no exception. So I changed it to the following to bypass this exception. {code} val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") {code} It works for me. was (Author: yetsun): I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) {code} case class MyClass(id:Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF {code} The exception only happened at the following line: {code} .map(a => MyClass(a(0).toInt, a(1))) {code} If I removed this line, there is no exception. So I changed it to the following to bypass this exception. {code} val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") {code} It works for me. > java.util.NoSuchElementException: None.get at at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > -- > > Key: SPARK-16599 > URL: https://issues.apache.org/jira/browse/SPARK-16599 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: centos 6.7 spark 2.0 >Reporter: binde > > run a spark job with spark 2.0, error message > Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
[ https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847889#comment-15847889 ] Jun Ye edited comment on SPARK-16599 at 2/1/17 2:24 AM: I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) {code} case class MyClass(id:Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF {code} The exception only happened at the following line: {code} .map(a => MyClass(a(0).toInt, a(1))) {code} If I removed this line, there is no exception. So I changed it to the following to bypass this exception. {code} val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") {code} It works for me. was (Author: yetsun): I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) {code} case class MyClass(id:Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF {code} The exception only happened at the following line {code} .map(a => MyClass(a(0).toInt, a(1))) {code} If I removed this line, there is no exception. So I change to the following to bypass this exception. {code} val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") {code} It works for me. > java.util.NoSuchElementException: None.get at at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > -- > > Key: SPARK-16599 > URL: https://issues.apache.org/jira/browse/SPARK-16599 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: centos 6.7 spark 2.0 >Reporter: binde > > run a spark job with spark 2.0, error message > Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
[ https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847889#comment-15847889 ] Jun Ye edited comment on SPARK-16599 at 2/1/17 2:22 AM: I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) {code:scala} case class MyClass(id:Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF {code} The exception only happened at the following line {code:scala} .map(a => MyClass(a(0).toInt, a(1))) {code} If I removed this line, there is no exception. So I change to the following to bypass this exception. {code:scala} val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") {code} It works for me. was (Author: yetsun): I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) ``` case class MyClass(id:Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF ``` The exception only happened at the following line ``` .map(a => MyClass(a(0).toInt, a(1))) ``` If I removed this line, there is no exception. So I change to the following to bypass this exception. ``` val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") ``` It works for me. > java.util.NoSuchElementException: None.get at at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > -- > > Key: SPARK-16599 > URL: https://issues.apache.org/jira/browse/SPARK-16599 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: centos 6.7 spark 2.0 >Reporter: binde > > run a spark job with spark 2.0, error message > Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
[ https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847889#comment-15847889 ] Jun Ye edited comment on SPARK-16599 at 2/1/17 2:23 AM: I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) {code} case class MyClass(id:Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF {code} The exception only happened at the following line {code} .map(a => MyClass(a(0).toInt, a(1))) {code} If I removed this line, there is no exception. So I change to the following to bypass this exception. {code} val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") {code} It works for me. was (Author: yetsun): I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) {code:scala} case class MyClass(id:Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF {code} The exception only happened at the following line {code:scala} .map(a => MyClass(a(0).toInt, a(1))) {code} If I removed this line, there is no exception. So I change to the following to bypass this exception. {code:scala} val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") {code} It works for me. > java.util.NoSuchElementException: None.get at at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > -- > > Key: SPARK-16599 > URL: https://issues.apache.org/jira/browse/SPARK-16599 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: centos 6.7 spark 2.0 >Reporter: binde > > run a spark job with spark 2.0, error message > Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16599) java.util.NoSuchElementException: None.get at at org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343)
[ https://issues.apache.org/jira/browse/SPARK-16599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847889#comment-15847889 ] Jun Ye commented on SPARK-16599: I got the same exception with the following code: (My Spark version is 2.1.0. Scala version: 2.11.8. Hadoop version: 2.7.3) ``` case class MyClass(id:Int, name: String) val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map(a => MyClass(a(0).toInt, a(1))) .toDF ``` The exception only happened at the following line ``` .map(a => MyClass(a(0).toInt, a(1))) ``` If I removed this line, there is no exception. So I change to the following to bypass this exception. ``` val myDF = sparkSession.sparkContext .textFile("s3a://mys3bucket/myTextFile.txt") .map(_.split("\t")) .map(_.map(_.trim)) .map { case a: Array[String] => (a(0).toInt, a(1)) }.toDF("id", "name") ``` It works for me. > java.util.NoSuchElementException: None.get at at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > -- > > Key: SPARK-16599 > URL: https://issues.apache.org/jira/browse/SPARK-16599 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: centos 6.7 spark 2.0 >Reporter: binde > > run a spark job with spark 2.0, error message > Job aborted due to stage failure: Task 0 in stage 821.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 821.0 (TID 1480, e103): > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.storage.BlockInfoManager.releaseAllLocksForTask(BlockInfoManager.scala:343) > at > org.apache.spark.storage.BlockManager.releaseAllLocksForTask(BlockManager.scala:644) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:281) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19378) StateOperator metrics should still return the total number of rows in state even if there was no data for a trigger
[ https://issues.apache.org/jira/browse/SPARK-19378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-19378. --- Resolution: Fixed Fix Version/s: 3.0.0 2.1.1 Issue resolved by pull request 16716 [https://github.com/apache/spark/pull/16716] > StateOperator metrics should still return the total number of rows in state > even if there was no data for a trigger > --- > > Key: SPARK-19378 > URL: https://issues.apache.org/jira/browse/SPARK-19378 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.1.1, 3.0.0 > > > If you have a StreamingDataFrame with an aggregation, we report a metric > called stateOperators which consists of a list of data points per aggregation > for our query (With Spark 2.1, only one aggregation is supported). > These data points report: > - numUpdatedStateRows > - numTotalStateRows > If a trigger had no data - therefore was not fired - we return 0 data points, > however we should actually return a data point with > - numTotalStateRows: numTotalStateRows in lastExecution > - numUpdatedStateRows: 0 > This also affects eventTime statistics. We should still provide the min, max, > avg even through the data didn't change. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()
[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847786#comment-15847786 ] Joseph K. Bradley edited comment on SPARK-12965 at 2/1/17 12:43 AM: I'd say this is both a SQL and MLlib issue, where the MLlib issue is blocked by the SQL one. * SQL: {{schema}} handles periods/quotes inconsistently relative to the rest of the Dataset API * ML: StringIndexer could avoid using schema.fieldNames and instead use an API provided by StructType for checking for the existence of a field. That said, that API needs to be added to StructType... I'm going to update this issue to be for ML only to handle fixing StringIndexer and link to a separate JIRA for the SQL issue. was (Author: josephkb): I'd say this is both a SQL and MLlib issue. I'm going to update this issue to be for ML only to handle fixing StringIndexer and link to a separate JIRA for the SQL issue. > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > --- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---++ > |a.b|a_b|a_bIndex| > +---+---++ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---++ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()
[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847786#comment-15847786 ] Joseph K. Bradley commented on SPARK-12965: --- I'd say this is both a SQL and MLlib issue. I'm going to update this issue to be for ML only to handle fixing StringIndexer and link to a separate JIRA for the SQL issue. > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > --- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---++ > |a.b|a_b|a_bIndex| > +---+---++ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---++ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()
[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12965: -- Affects Version/s: (was: 1.6.0) 2.2.0 1.6.3 2.0.2 2.1.0 > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > --- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---++ > |a.b|a_b|a_bIndex| > +---+---++ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---++ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12965) Indexer setInputCol() doesn't resolve column names like DataFrame.col()
[ https://issues.apache.org/jira/browse/SPARK-12965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12965: -- Component/s: (was: Spark Core) > Indexer setInputCol() doesn't resolve column names like DataFrame.col() > --- > > Key: SPARK-12965 > URL: https://issues.apache.org/jira/browse/SPARK-12965 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Joshua Taylor > Attachments: SparkMLDotColumn.java > > > The setInputCol() method doesn't seem to resolve column names in the same way > that other methods do. E.g., Given a DataFrame df, {{df.col("`a.b`")}} will > return a column. On a StringIndexer indexer, > {{indexer.setInputCol("`a.b`")}} produces leads to an indexer where fitting > and transforming seem to have no effect. Running the following code produces: > {noformat} > +---+---++ > |a.b|a_b|a_bIndex| > +---+---++ > |foo|foo| 0.0| > |bar|bar| 1.0| > +---+---++ > {noformat} > but I think it should have another column, {{abIndex}} with the same contents > as a_bIndex. > {code} > public class SparkMLDotColumn { > public static void main(String[] args) { > // Get the contexts > SparkConf conf = new SparkConf() > .setMaster("local[*]") > .setAppName("test") > .set("spark.ui.enabled", "false"); > JavaSparkContext sparkContext = new JavaSparkContext(conf); > SQLContext sqlContext = new SQLContext(sparkContext); > > // Create a schema with a single string column named "a.b" > StructType schema = new StructType(new StructField[] { > DataTypes.createStructField("a.b", > DataTypes.StringType, false) > }); > // Create an empty RDD and DataFrame > List rows = Arrays.asList(RowFactory.create("foo"), > RowFactory.create("bar")); > JavaRDD rdd = sparkContext.parallelize(rows); > DataFrame df = sqlContext.createDataFrame(rdd, schema); > > df = df.withColumn("a_b", df.col("`a.b`")); > > StringIndexer indexer0 = new StringIndexer(); > indexer0.setInputCol("a_b"); > indexer0.setOutputCol("a_bIndex"); > df = indexer0.fit(df).transform(df); > > StringIndexer indexer1 = new StringIndexer(); > indexer1.setInputCol("`a.b`"); > indexer1.setOutputCol("abIndex"); > df = indexer1.fit(df).transform(df); > > df.show(); > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19416) Dataset.schema is inconsistent with Dataset in handling columns with periods
Joseph K. Bradley created SPARK-19416: - Summary: Dataset.schema is inconsistent with Dataset in handling columns with periods Key: SPARK-19416 URL: https://issues.apache.org/jira/browse/SPARK-19416 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2, 1.6.3, 2.2.0 Reporter: Joseph K. Bradley Priority: Minor When you have a DataFrame with a column with a period in its name, the API is inconsistent about how to quote the column name. Here's a reproduction: {code} import org.apache.spark.sql.functions.col val rows = Seq( ("foo", 1), ("bar", 2) ) val df = spark.createDataFrame(rows).toDF("a.b", "id") {code} These methods are all consistent: {code} df.select("a.b") // fails df.select("`a.b`") // succeeds df.select(col("a.b")) // fails df.select(col("`a.b`")) // succeeds df("a.b") // fails df("`a.b`") // succeeds {code} But {{schema}} is inconsistent: {code} df.schema("a.b") // succeeds df.schema("`a.b`") // fails {code} "fails" produces error messages like: {code} org.apache.spark.sql.AnalysisException: cannot resolve '`a.b`' given input columns: [a.b, id];; 'Project ['a.b] +- Project [_1#1511 AS a.b#1516, _2#1512 AS id#1517] +- LocalRelation [_1#1511, _2#1512] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822) at org.apache.spark.sql.Dataset.select(Dataset.scala:1121) at org.apache.spark.sql.Dataset.select(Dataset.scala:1139) at line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw$$iw$$iw.(:34) at line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw$$iw.(:41) at line9667c6d14e79417280e5882aa52e0de727.$read$$iw$$iw.(:43) at line9667c6d14e79417280e5882aa52e0de727.$read$$iw.(:45) at line9667c6d14e79417280e5882aa52e0de727.$eval$.$print$lzycompute(:7) at line9667c6d14e79417280e5882aa52e0de727.$eval$.$print(:6) {code} "succeeds" produces: {code} org.apache.spark.sql.DataFrame = [a.b: string
[jira] [Resolved] (SPARK-19415) Improve the implicit type conversion between numeric type and string to avoid precesion loss
[ https://issues.apache.org/jira/browse/SPARK-19415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19415. Resolution: Duplicate Fix Version/s: 2.2.0 > Improve the implicit type conversion between numeric type and string to avoid > precesion loss > > > Key: SPARK-19415 > URL: https://issues.apache.org/jira/browse/SPARK-19415 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > Fix For: 2.2.0 > > > Currently, Spark SQL will convert both numeric type and string into > DoubleType, if the two children of a expression does not match (for example, > comparing a LongType again StringType), this will cause precision loss in > some cases. > Some database does better job one this (for example, SQL Server [1]), we > should follow that. > [1] https://msdn.microsoft.com/en-us/library/ms191530.aspx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19382) Test sparse vectors in LinearSVCSuite
[ https://issues.apache.org/jira/browse/SPARK-19382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847683#comment-15847683 ] Miao Wang commented on SPARK-19382: --- [~josephkb] If I understand correctly, I think we have to create separate tests for SparseVector. For example, assert(model.numFeatures === 2) in test("linear svc: default params"). If it is the DenseVector case, each Vector is size 2, which determines model.numFeatures = summarizer.mean.size = n = instance.size =2. However, if I create a SparseVector of size 20 with non-zero values the same as the DenseVector (i.e., 2 non-zero values and 18 zero values), model.numFeatures = 20, based on the logic above. Therefore, we should create separate test case for SparseVector, or we have to remove the test above. test("linearSVC comparison with R e1071 and scikit-learn") also fails for all SparseVector case. Other tests pass for all SparseVector case. I am generating a mixed test now. > Test sparse vectors in LinearSVCSuite > - > > Key: SPARK-19382 > URL: https://issues.apache.org/jira/browse/SPARK-19382 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Currently, LinearSVCSuite does not test sparse vectors. We should. I > recommend that generateSVMInput be modified to create a mix of dense and > sparse vectors, rather than adding an additional test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19415) Improve the implicit type conversion between numeric type and string to avoid precesion loss
Davies Liu created SPARK-19415: -- Summary: Improve the implicit type conversion between numeric type and string to avoid precesion loss Key: SPARK-19415 URL: https://issues.apache.org/jira/browse/SPARK-19415 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Currently, Spark SQL will convert both numeric type and string into DoubleType, if the two children of a expression does not match (for example, comparing a LongType again StringType), this will cause precision loss in some cases. Some database does better job one this (for example, SQL Server [1]), we should follow that. [1] https://msdn.microsoft.com/en-us/library/ms191530.aspx -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847608#comment-15847608 ] sam elamin commented on SPARK-19414: Perfect ill do that, or should I say attempt! but Im loving the structured streaming API! Keep up the amazing work! Thanks again! > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19414. -- Resolution: Not A Bug > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847606#comment-15847606 ] Shixiong Zhu commented on SPARK-19414: -- Yes. > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847602#comment-15847602 ] sam elamin commented on SPARK-19414: [~zsxwing] most probably yes. ah ok i think i understand what you might mean in the source schema provider I can call bq to get the schema and override it at that layer? > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18872) New test cases for EXISTS subquery
[ https://issues.apache.org/jira/browse/SPARK-18872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847596#comment-15847596 ] Apache Spark commented on SPARK-18872: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/16760 > New test cases for EXISTS subquery > -- > > Key: SPARK-18872 > URL: https://issues.apache.org/jira/browse/SPARK-18872 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Reporter: Nattavut Sutyanyong >Assignee: Dilip Biswal > Fix For: 2.2.0 > > > This JIRA is for submitting a PR for new EXISTS/NOT EXISTS subquery test > cases. It follows the same idea as the IN subquery test cases which contain > simple patterns, then build more complex constructs in both parent and > subquery sides. This batch of test cases are mostly, if not all, positive > test cases that do not return any syntax errors or unsupported functionality. > We make effort to have test cases returning rows in the result set so that > they can indirectly detect incorrect result problems. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847598#comment-15847598 ] Shixiong Zhu commented on SPARK-19414: -- [~samelamin] I know little about BigQuery. Does it provide an API to get the table schema? > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18871) New test cases for IN/NOT IN subquery
[ https://issues.apache.org/jira/browse/SPARK-18871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847592#comment-15847592 ] Apache Spark commented on SPARK-18871: -- User 'kevinyu98' has created a pull request for this issue: https://github.com/apache/spark/pull/16759 > New test cases for IN/NOT IN subquery > - > > Key: SPARK-18871 > URL: https://issues.apache.org/jira/browse/SPARK-18871 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Reporter: Nattavut Sutyanyong >Assignee: kevin yu > Fix For: 2.2.0 > > > This JIRA is open for submitting a PR for new test cases for IN/NOT IN > subquery. We plan to put approximately 100+ test cases under > `SQLQueryTestSuite`. The test cases range from IN/NOT IN subqueries with > simple SELECT in both parent and subquery to subqueries with more complex > constructs in both sides (joins, aggregates, etc.) Test data include null > value, and duplicate values. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847581#comment-15847581 ] sam elamin commented on SPARK-19414: Ah thanks for clarifying! Thats a bit of a shame, ideally id like to read a stream and infer the schema from it, but I guess spark.read.json isnt quite the same as spark.readStream.json since one is under the sql namespace! my solution so far is to read one record from BigQuery, infer the schema there then pass it down to the readStream method Its a bit convoluted to be honest so if there is a cleaner or nicer way other than hardcoding the schema then happy to hear it! > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847578#comment-15847578 ] Shixiong Zhu commented on SPARK-19414: -- Since you can get the DataFrame, the place to create this DataFrame must know the schema. Right? > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847574#comment-15847574 ] Shixiong Zhu commented on SPARK-19414: -- Oh, I see, You cannot infer schema when getting the DataFrame. The schema must be provided before running the streaming query as SQL needs it to verify the query. > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847571#comment-15847571 ] Shixiong Zhu commented on SPARK-19414: -- [~samelamin] I would suggest that you take a look at the Kafka source: https://github.com/apache/spark/tree/master/external/kafka-0-10-sql The magic part is this file: https://github.com/apache/spark/blob/master/external/kafka-0-10-sql/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam elamin reopened SPARK-19414: > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847565#comment-15847565 ] sam elamin commented on SPARK-19414: Thanks [~zsxwing] but I dont have access to the dataframe at the StreamSourceProvider, so how can I set it at that level? Let me explain, I am calling the source first then setting the schema from the returned dataframe At the source level I have the dataframe so can override the schema, but at the StreamSourceProvider I dont know where to get it from? > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19414: - Issue Type: Question (was: Bug) > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19414. -- Resolution: Not A Bug > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Question > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-19414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847558#comment-15847558 ] Shixiong Zhu commented on SPARK-19414: -- You also need to override `org.apache.spark.sql.sources.StreamSourceProvider.sourceSchema`. > Inferring schema in a structured streaming source > - > > Key: SPARK-19414 > URL: https://issues.apache.org/jira/browse/SPARK-19414 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Reporter: sam elamin > > Hi All > I am writing a connector to BigQuery that uses structured streaming, my > question is about schemas > I would like to be able to infer the schema from BQ rather than pass it in, > is there any way to overwrite the source schema in anything that extends > org.apache.spark.sql.execution.streaming.Source > Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18085) Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847519#comment-15847519 ] Marcelo Vanzin commented on SPARK-18085: Storage tab available at this branch: https://github.com/vanzin/spark/tree/shs-ng/M4.3 > Better History Server scalability for many / large applications > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18131) Support returning Vector/Dense Vector from backend
[ https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847473#comment-15847473 ] Shivaram Venkataraman commented on SPARK-18131: --- Hmm - this is tricky. We ran into a similar issue in SQL and we added a reader, writer object in SQL that was registered to the method in core. See https://github.com/apache/spark/blob/ce112cec4f9bff222aa256893f94c316662a2a7e/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L39 for how we did that. We could do a similar thing in MLlib as well ? cc [~mengxr] > Support returning Vector/Dense Vector from backend > -- > > Key: SPARK-18131 > URL: https://issues.apache.org/jira/browse/SPARK-18131 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Miao Wang > > For `spark.logit`, there is a `probabilityCol`, which is a vector in the > backend (scala side). When we do collect(select(df, "probabilityCol")), > backend returns the java object handle (memory address). We need to implement > a method to convert a Vector/Dense Vector column as R vector, which can be > read in SparkR. It is a followup JIRA of adding `spark.logit`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19414) Inferring schema in a structured streaming source
sam elamin created SPARK-19414: -- Summary: Inferring schema in a structured streaming source Key: SPARK-19414 URL: https://issues.apache.org/jira/browse/SPARK-19414 Project: Spark Issue Type: Bug Components: Structured Streaming Reporter: sam elamin Hi All I am writing a connector to BigQuery that uses structured streaming, my question is about schemas I would like to be able to infer the schema from BQ rather than pass it in, is there any way to overwrite the source schema in anything that extends org.apache.spark.sql.execution.streaming.Source Overriding the schema method doesnt seem to work -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2
[ https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847452#comment-15847452 ] Felix Cheung commented on SPARK-18864: -- SPARK-19066 LDA doesn't set optimizer correctly (This is also in 2.1.1 though) > Changes of MLlib and SparkR behavior for 2.2 > > > Key: SPARK-18864 > URL: https://issues.apache.org/jira/browse/SPARK-18864 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, MLlib, SparkR >Reporter: Joseph K. Bradley > > This JIRA is for tracking changes of behavior within MLlib and SparkR for the > Spark 2.2 release. If any JIRAs change behavior, please list them below with > a short description of the change. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2
[ https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-18864: - Comment: was deleted (was: SPARK-19291 spark.gaussianMixture supports output log-likelihood) > Changes of MLlib and SparkR behavior for 2.2 > > > Key: SPARK-18864 > URL: https://issues.apache.org/jira/browse/SPARK-18864 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, MLlib, SparkR >Reporter: Joseph K. Bradley > > This JIRA is for tracking changes of behavior within MLlib and SparkR for the > Spark 2.2 release. If any JIRAs change behavior, please list them below with > a short description of the change. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2
[ https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847449#comment-15847449 ] Felix Cheung commented on SPARK-18864: -- SPARK-19291 spark.gaussianMixture supports output log-likelihood > Changes of MLlib and SparkR behavior for 2.2 > > > Key: SPARK-18864 > URL: https://issues.apache.org/jira/browse/SPARK-18864 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, MLlib, SparkR >Reporter: Joseph K. Bradley > > This JIRA is for tracking changes of behavior within MLlib and SparkR for the > Spark 2.2 release. If any JIRAs change behavior, please list them below with > a short description of the change. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19395) Convert coefficients in summary to matrix
[ https://issues.apache.org/jira/browse/SPARK-19395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-19395. -- Resolution: Fixed Assignee: Wayne Zhang Fix Version/s: 2.2.0 Target Version/s: 2.2.0 > Convert coefficients in summary to matrix > - > > Key: SPARK-19395 > URL: https://issues.apache.org/jira/browse/SPARK-19395 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Wayne Zhang >Assignee: Wayne Zhang > Fix For: 2.2.0 > > > The coefficients component in model summary should be 'matrix' but the > underlying structure is indeed list. This affects several models except for > 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is > to first unlist the coefficients returned from the callJMethod before > converting to matrix. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2
[ https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847442#comment-15847442 ] Felix Cheung edited comment on SPARK-18864 at 1/31/17 8:24 PM: --- SPARK-19395 changes to summary format for coefficients was (Author: felixcheung): SPARK-19395 changes to summary format > Changes of MLlib and SparkR behavior for 2.2 > > > Key: SPARK-18864 > URL: https://issues.apache.org/jira/browse/SPARK-18864 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, MLlib, SparkR >Reporter: Joseph K. Bradley > > This JIRA is for tracking changes of behavior within MLlib and SparkR for the > Spark 2.2 release. If any JIRAs change behavior, please list them below with > a short description of the change. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18864) Changes of MLlib and SparkR behavior for 2.2
[ https://issues.apache.org/jira/browse/SPARK-18864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847442#comment-15847442 ] Felix Cheung commented on SPARK-18864: -- SPARK-19395 changes to summary format > Changes of MLlib and SparkR behavior for 2.2 > > > Key: SPARK-18864 > URL: https://issues.apache.org/jira/browse/SPARK-18864 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, MLlib, SparkR >Reporter: Joseph K. Bradley > > This JIRA is for tracking changes of behavior within MLlib and SparkR for the > Spark 2.2 release. If any JIRAs change behavior, please list them below with > a short description of the change. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19382) Test sparse vectors in LinearSVCSuite
[ https://issues.apache.org/jira/browse/SPARK-19382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847391#comment-15847391 ] Miao Wang commented on SPARK-19382: --- I can try to submit a PR today or tomorrow. Thanks! > Test sparse vectors in LinearSVCSuite > - > > Key: SPARK-19382 > URL: https://issues.apache.org/jira/browse/SPARK-19382 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > Currently, LinearSVCSuite does not test sparse vectors. We should. I > recommend that generateSVMInput be modified to create a mix of dense and > sparse vectors, rather than adding an additional test. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18131) Support returning Vector/Dense Vector from backend
[ https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847390#comment-15847390 ] Miao Wang commented on SPARK-18131: --- [~felixcheung][~yanboliang][~shivaram] I am trying to add the serialization utility to SerDe.scala. Inside def writeObject(dos: DataOutputStream, obj: Object, jvmObjectTracker: JVMObjectTracker): Unit, one case should be added: case v: org.apache.spark.ml.linalg.DenseVector => This file is in spark-core. So I can't import org.apache.spark.ml.linalg._ in this file, because of dependency issue. Do you have any suggestions? One possibility is to move `Vectors` from mllib-local to core folder. I am not sure whether there are other options. Thanks! > Support returning Vector/Dense Vector from backend > -- > > Key: SPARK-18131 > URL: https://issues.apache.org/jira/browse/SPARK-18131 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Miao Wang > > For `spark.logit`, there is a `probabilityCol`, which is a vector in the > backend (scala side). When we do collect(select(df, "probabilityCol")), > backend returns the java object handle (memory address). We need to implement > a method to convert a Vector/Dense Vector column as R vector, which can be > read in SparkR. It is a followup JIRA of adding `spark.logit`. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19413) Basic mapGroupsWithState API
[ https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847356#comment-15847356 ] Apache Spark commented on SPARK-19413: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/16758 > Basic mapGroupsWithState API > > > Key: SPARK-19413 > URL: https://issues.apache.org/jira/browse/SPARK-19413 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > Basic API (without timeouts) as described in the parent JIRA SPARK-19067 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19413) Basic mapGroupsWithState API
[ https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19413: Assignee: Tathagata Das (was: Apache Spark) > Basic mapGroupsWithState API > > > Key: SPARK-19413 > URL: https://issues.apache.org/jira/browse/SPARK-19413 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > > Basic API (without timeouts) as described in the parent JIRA SPARK-19067 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19413) Basic mapGroupsWithState API
[ https://issues.apache.org/jira/browse/SPARK-19413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19413: Assignee: Apache Spark (was: Tathagata Das) > Basic mapGroupsWithState API > > > Key: SPARK-19413 > URL: https://issues.apache.org/jira/browse/SPARK-19413 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Reporter: Tathagata Das >Assignee: Apache Spark > > Basic API (without timeouts) as described in the parent JIRA SPARK-19067 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19067) mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState)
[ https://issues.apache.org/jira/browse/SPARK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-19067: -- Summary: mapGroupsWithState - arbitrary stateful operations with Structured Streaming (similar to DStream.mapWithState) (was: mapWithState Style API) > mapGroupsWithState - arbitrary stateful operations with Structured Streaming > (similar to DStream.mapWithState) > -- > > Key: SPARK-19067 > URL: https://issues.apache.org/jira/browse/SPARK-19067 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Reporter: Michael Armbrust >Assignee: Tathagata Das >Priority: Critical > > Right now the only way to do stateful operations with with Aggregator or > UDAF. However, this does not give users control of emission or expiration of > state making it hard to implement things like sessionization. We should add > a more general construct (probably similar to {{DStream.mapWithState}}) to > structured streaming. Here is the design. > *Requirements* > - Users should be able to specify a function that can do the following > - Access the input row corresponding to a key > - Access the previous state corresponding to a key > - Optionally, update or remove the state > - Output any number of new rows (or none at all) > *Proposed API* > {code} > // New methods on KeyValueGroupedDataset > class KeyValueGroupedDataset[K, V] { > // Scala friendly > def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], > State[S]) => U) > def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, > Iterator[V], State[S]) => Iterator[U]) > // Java friendly >def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, > R], stateEncoder: Encoder[S], resultEncoder: Encoder[U]) >def flatMapGroupsWithState[S, U](func: > FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], > resultEncoder: Encoder[U]) > } > // --- New Java-friendly function classes --- > public interface MapGroupsWithStateFunction extends Serializable { > R call(K key, Iterator values, state: State) throws Exception; > } > public interface FlatMapGroupsWithStateFunction extends > Serializable { > Iterator call(K key, Iterator values, state: State) throws > Exception; > } > // -- Wrapper class for state data -- > trait State[S] { > def exists(): Boolean > def get(): S// throws Exception is state does not > exist > def getOption(): Option[S] > def update(newState: S): Unit > def remove(): Unit // exists() will be false after this > } > {code} > Key Semantics of the State class > - The state can be null. > - If the state.remove() is called, then state.exists() will return false, and > getOption will returm None. > - After that state.update(newState) is called, then state.exists() will > return true, and getOption will return Some(...). > - None of the operations are thread-safe. This is to avoid memory barriers. > *Usage* > {code} > val stateFunc = (word: String, words: Iterator[String, runningCount: > State[Long]) => { > val newCount = words.size + runningCount.getOption.getOrElse(0L) > runningCount.update(newCount) >(word, newCount) > } > dataset // type > is Dataset[String] > .groupByKey[String](w => w) // generates > KeyValueGroupedDataset[String, String] > .mapGroupsWithState[Long, (String, Long)](stateFunc)// returns > Dataset[(String, Long)] > {code} > *Future Directions* > - Timeout based state expiration (that has not received data for a while) > - General expression based expiration -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-19413) Basic mapGroupsWithState API
Tathagata Das created SPARK-19413: - Summary: Basic mapGroupsWithState API Key: SPARK-19413 URL: https://issues.apache.org/jira/browse/SPARK-19413 Project: Spark Issue Type: Sub-task Components: Structured Streaming Reporter: Tathagata Das Assignee: Tathagata Das Basic API (without timeouts) as described in the parent JIRA SPARK-19067 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-10057) Faill to load class org.slf4j.impl.StaticLoggerBinder
[ https://issues.apache.org/jira/browse/SPARK-10057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847242#comment-15847242 ] Wes McKinney commented on SPARK-10057: -- I have been having a really horrible time trying to get Spark SQL to emit log messages when run from PySpark {code} $ pyspark Python 3.5.3 | packaged by conda-forge | (default, Jan 23 2017, 19:01:48) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] on linux Type "help", "copyright", "credits" or "license" for more information. 17/01/31 12:59:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/01/31 12:59:35 WARN Utils: Your hostname, badgerpad15 resolves to a loopback address: 127.0.0.1; using 172.31.22.23 instead (on interface wlan0) 17/01/31 12:59:35 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.2.0-SNAPSHOT /_/ Using Python version 3.5.3 (default, Jan 23 2017 19:01:48) SparkSession available as 'spark'. >>> df = sqlContext.read.parquet('example.parquet') SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. >>> {code} After this, any logging messages in Spark SQL that I'd like to see are suppressed This is on vanilla Spark trunk. I'm not sure if this is the same bug, but can anyone give any advance? > Faill to load class org.slf4j.impl.StaticLoggerBinder > - > > Key: SPARK-10057 > URL: https://issues.apache.org/jira/browse/SPARK-10057 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0, 1.6.0 >Reporter: Davies Liu > > Some loggings are dropped, because it can't load class > "org.slf4j.impl.StaticLoggerBinder" > {code} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-18609) [SQL] column mixup with CROSS JOIN
[ https://issues.apache.org/jira/browse/SPARK-18609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15847103#comment-15847103 ] Apache Spark commented on SPARK-18609: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/16757 > [SQL] column mixup with CROSS JOIN > -- > > Key: SPARK-18609 > URL: https://issues.apache.org/jira/browse/SPARK-18609 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Furcy Pin > > Reproduced on spark-sql v2.0.2 and on branch master. > {code} > DROP TABLE IF EXISTS p1 ; > DROP TABLE IF EXISTS p2 ; > CREATE TABLE p1 (col TIMESTAMP) ; > CREATE TABLE p2 (col TIMESTAMP) ; > set spark.sql.crossJoin.enabled = true; > -- EXPLAIN > WITH CTE AS ( > SELECT > s2.col as col > FROM p1 > CROSS JOIN ( > SELECT > e.col as col > FROM p2 E > ) s2 > ) > SELECT > T1.col as c1, > T2.col as c2 > FROM CTE T1 > CROSS JOIN CTE T2 > ; > {code} > This returns the following stacktrace : > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: col#21 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:55) > at > org.apache.spark.sql.execution.ProjectExec$$anonfun$4.apply(basicPhysicalOperators.scala:54) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:54) > at > org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153) > at > org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:244) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:218) > at > org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:40) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) > at > org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) > at > org.apache.spark.sql.execution.Spark
[jira] (SPARK-19407) defaultFS is used FileSystem.get instead of getting it from uri scheme
Title: Message Title Amit Assudani commented on SPARK-19407 Re: defaultFS is used FileSystem.get instead of getting it from uri scheme I'll do it over the weekend, you may assign it to me. I may need some logistics support as it will be my first PR. Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19412) HiveContext hive queries differ from CLI hive queries
Title: Message Title Enrique Sainz Baixauli updated an issue Spark / SPARK-19412 HiveContext hive queries differ from CLI hive queries Change By: Enrique Sainz Baixauli Hi,I've been wandering around trying to find a report related to this, but haven't found such so far. So here it is.I have found two differences between Hive queries made from SparkSQL and from Hive CLI (or beeswax or Hue):First: a query against a single partitioned table (although other, non-partitioned tables seem to work well) returns all column values in the first column, and null in the rest of them (field separator is '\t'):{code}scala > sqlContext.sql("use [database_name])scala> val t1 = sqlContext.sql("select * from [table_name]")scala> t1.first.get(0)res9: Any = "0001 1 [...]"scala> t1.first.get(1)res10: Any = null{code}And the same for the following column indexes from 1 on. However, the same query in Hive CLI/Beeswax/Hue shows all columns filled and no null values at the end.Second: a join between two partitioned tables returns an empty DataFrame when the same query in Hive returns rows filled with values:{code}scala> val join = sqlContext.sql("select * from [table_1_name] t1 join [table_2_name] t2 on t1.id = t2.id where t2.data_timestamp_part = '1485428556075'")scala> join.first[...]java.util.NoSuchElementException: next on empty iterator{code}These two issues happen in Spark 1.5 (CDH 5.5.1) but seem to be solved in 1.6 (CDH 5.7.0). Is there a chance of backporting this fix? And is there a bug report that I can't find where this was reported before?Thank you all! P.S: in this test, even if both tables are partitioned, the table in the first issue and the first table in the join, which are the same, only have one partition. That's why there is no where clause regarding partitions Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19412) HiveContext hive queries differ from CLI hive queries
Title: Message Title Enrique Sainz Baixauli created an issue Spark / SPARK-19412 HiveContext hive queries differ from CLI hive queries Issue Type: Bug Affects Versions: 1.5.1 Assignee: Unassigned Components: SQL Created: 31/Jan/17 14:29 Priority: Major Reporter: Enrique Sainz Baixauli Hi, I've been wandering around trying to find a report related to this, but haven't found such so far. So here it is. I have found two differences between Hive queries made from SparkSQL and from Hive CLI (or beeswax or Hue): First: a query against a single partitioned table (although other, non-partitioned tables seem to work well) returns all column values in the first column, and null in the rest of them (field separator is '\t'): scala > sqlContext.sql("use [database_name]) scala> val t1 = sqlContext.sql("select * from [table_name]") scala> t1.first.get(0) res9: Any = "0001 1 [...]" scala> t1.first.get(1) res10: Any = null And the same for
[jira] (SPARK-19411) Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown
Title: Message Title Apache Spark commented on SPARK-19411 Re: Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/16756 Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19411) Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown
Title: Message Title Apache Spark assigned an issue to Apache Spark Spark / SPARK-19411 Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19411) Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown
Title: Message Title Apache Spark assigned an issue to Unassigned Spark / SPARK-19411 Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19411) Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown
Title: Message Title Liang-Chi Hsieh created an issue Spark / SPARK-19411 Remove the metadata used to mark optional columns in merged Parquet schema for filter predicate pushdown Issue Type: Improvement Assignee: Unassigned Created: 31/Jan/17 13:46 Priority: Major Reporter: Liang-Chi Hsieh There is a metadata introduced before to mark the optional columns in merged Parquet schema for filter predicate pushdown. As we upgrade to Parquet 1.8.2 which includes the fix for the pushdown of optional columns, we don't need this metadata now. Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
Title: Message Title Hamel Ajay Kothari commented on SPARK-18278 Re: Support native submission of spark jobs to a kubernetes cluster Hi all, the pluggable scheduler sounds pretty valuable for my workflow as well. We've got our own scheduler where I work that we need to support in spark and currently that requires us to maintain an entire fork that we must keep up to date. If we could get a pluggable scheduler that would reduce our maintenance burden significantly. Has a JIRA been filed for this? I'd be happy to contribute some work on it. Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19410) Links to API documentation are broken
Title: Message Title zhengruifeng edited a comment on SPARK-19410 Re: Links to API documentation are broken Actually, excecpt {{ml-tuning.md}} and {{ml-pipeline.md}}, all markdown file use {{}}.Only those broken links are in {{}} {code}egrep 'data\-lang=\"[a-z]+\">' docs/*md -ndocs/ml-pipeline.md:209:docs/ml-pipeline.md:218:docs/ml-pipeline.md:227:docs/ml-pipeline.md:244:docs/ml-pipeline.md:251:docs/ml-pipeline.md:259:docs/ml-tuning.md:77:docs/ml-tuning.md:84:docs/ml-tuning.md:91:docs/ml-tuning.md:131:{code} Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19410) Links to API documentation are broken
Title: Message Title Apache Spark assigned an issue to Apache Spark Spark / SPARK-19410 Links to API documentation are broken Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19410) Links to API documentation are broken
Title: Message Title Apache Spark assigned an issue to Unassigned Spark / SPARK-19410 Links to API documentation are broken Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19410) Links to API documentation are broken
Title: Message Title Apache Spark commented on SPARK-19410 Re: Links to API documentation are broken User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/16754 Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19410) Links to API documentation are broken
Title: Message Title zhengruifeng commented on SPARK-19410 Re: Links to API documentation are broken Actually, excecpt ml-tuning.md and ml-pipeline.md, all markdown file use . Only those broken links are in Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19410) Links to API documentation are broken
Title: Message Title zhengruifeng commented on SPARK-19410 Re: Links to API documentation are broken Sean Owen This also happens in ml-tuning.md. In TrainValidationSplit, the links for scala and java are ok, but the one for python don't work. After comparing the right links and the broken ones, I found that this should be caused by the div: links in are rendered correctly, but links in are broken. Thanks for ping me. I will create a pr to fix this. Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19410) Links to API documentation are broken
Title: Message Title Sean Owen updated an issue Spark / SPARK-19410 Links to API documentation are broken Hm, sure enough. I don't see an obvious problem with the markdown. These were added in https://issues.apache.org/jira/browse/SPARK-18446 yet other apparently identical links added there seem to render correctly. zhengruifeng I'm a little stumped. Do you have any ideas? Still looking at the markdown here. Change By: Sean Owen Priority: Major Trivial Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19409) Upgrade Parquet to 1.8.2
Title: Message Title Reynold Xin resolved as Fixed Spark / SPARK-19409 Upgrade Parquet to 1.8.2 Change By: Reynold Xin Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.2.0 Status: In Progress Resolved Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19394) "assertion failed: Expected hostname" on macOS when self-assigned IP contains a percent sign
Title: Message Title Sean Owen resolved as Not A Problem I think that's the answer then. This is a corner case, and the hostname the machine is returning is not actually valid, as per RFC: https://en.wikipedia.org/wiki/Hostname Spark / SPARK-19394 "assertion failed: Expected hostname" on macOS when self-assigned IP contains a percent sign Change By: Sean Owen Resolution: Not A Problem Status: Open Resolved Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19296) Awkward changes for JdbcUtils.saveTable in Spark 2.1.0
Title: Message Title Apache Spark assigned an issue to Unassigned Spark / SPARK-19296 Awkward changes for JdbcUtils.saveTable in Spark 2.1.0 Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19296) Awkward changes for JdbcUtils.saveTable in Spark 2.1.0
Title: Message Title Apache Spark assigned an issue to Apache Spark Spark / SPARK-19296 Awkward changes for JdbcUtils.saveTable in Spark 2.1.0 Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19296) Awkward changes for JdbcUtils.saveTable in Spark 2.1.0
Title: Message Title Apache Spark commented on SPARK-19296 Re: Awkward changes for JdbcUtils.saveTable in Spark 2.1.0 User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16753 Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19410) Links to API documentation are broken
Title: Message Title Aseem Bansal created an issue Spark / SPARK-19410 Links to API documentation are broken Issue Type: Documentation Affects Versions: 2.1.0 Assignee: Unassigned Components: Documentation Created: 31/Jan/17 08:55 Priority: Major Reporter: Aseem Bansal I was looking at https://spark.apache.org/docs/latest/ml-pipeline.html#example-estimator-transformer-and-param and saw that the links to API documentation are broken Add Comment
[jira] (SPARK-19409) Upgrade Parquet to 1.8.2
Title: Message Title Apache Spark assigned an issue to Unassigned Spark / SPARK-19409 Upgrade Parquet to 1.8.2 Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19409) Upgrade Parquet to 1.8.2
Title: Message Title Apache Spark commented on SPARK-19409 Re: Upgrade Parquet to 1.8.2 User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/16751 Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19409) Upgrade Parquet to 1.8.2
Title: Message Title Dongjoon Hyun updated an issue Spark / SPARK-19409 Upgrade Parquet to 1.8.2 Change By: Dongjoon Hyun Apache Parquet 1.8.2 is released officially last week on 26 Jan.This issue aims to bump Parquet version to 1.8.2 since it includes many fixes .https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142@%3Cdev.parquet.apache.org%3E Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19409) Upgrade Parquet to 1.8.2
Title: Message Title Apache Spark assigned an issue to Apache Spark Spark / SPARK-19409 Upgrade Parquet to 1.8.2 Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-19409) Upgrade Parquet to 1.8.2
Title: Message Title Dongjoon Hyun created an issue Spark / SPARK-19409 Upgrade Parquet to 1.8.2 Issue Type: Bug Affects Versions: 2.1.0 Assignee: Unassigned Components: Build Created: 31/Jan/17 08:41 Priority: Major Reporter: Dongjoon Hyun Apache Parquet 1.8.2 is released officially last week on 26 Jan. This issue aims to bump Parquet version to 1.8.2. https://lists.apache.org/thread.html/af0c813f1419899289a336d96ec02b3bbeecaea23aa6ef69f435c142@%3Cdev.parquet.apache.org%3E Add Comment
[jira] (SPARK-18937) Timezone support in CSV/JSON parsing
Title: Message Title Apache Spark commented on SPARK-18937 Re: Timezone support in CSV/JSON parsing User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/16750 Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-18937) Timezone support in CSV/JSON parsing
Title: Message Title Apache Spark assigned an issue to Unassigned Spark / SPARK-18937 Timezone support in CSV/JSON parsing Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
[jira] (SPARK-18937) Timezone support in CSV/JSON parsing
Title: Message Title Apache Spark assigned an issue to Apache Spark Spark / SPARK-18937 Timezone support in CSV/JSON parsing Change By: Apache Spark Assignee: Apache Spark Add Comment This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)