[jira] [Commented] (SPARK-19536) Improve capability to merge SQL data types
[ https://issues.apache.org/jira/browse/SPARK-19536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860816#comment-15860816 ] Hyukjin Kwon commented on SPARK-19536: -- Thank you for kindly adding a link. > Improve capability to merge SQL data types > -- > > Key: SPARK-19536 > URL: https://issues.apache.org/jira/browse/SPARK-19536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: koert kuipers >Priority: Minor > > spark's union/merging of compatible types seems kind of weak. it works on > basic types in the top level record, but it fails for nested records, maps, > arrays, etc. > i would like to improve this. > for example i get errors like this: > {noformat} > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. StructType(StructField(_1,StringType,true), > StructField(_2,IntegerType,false)) <> > StructType(StructField(_1,StringType,true), StructField(_2,LongType,false)) > at the first column of the second table > {noformat} > some examples that do work: > {noformat} > scala> Seq(1, 2, 3).toDF union Seq(1L, 2L, 3L).toDF > res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: bigint] > scala> Seq((1,"x"), (2,"x"), (3,"x")).toDF union Seq((1L,"x"), (2L,"x"), > (3L,"x")).toDF > res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: bigint, > _2: string] > {noformat} > what i would also expect to work but currently doesn't: > {noformat} > scala> Seq((Seq(1),"x"), (Seq(2),"x"), (Seq(3),"x")).toDF union > Seq((Seq(1L),"x"), (Seq(2L),"x"), (Seq(3L),"x")).toDF > scala> Seq((1,("x",1)), (2,("x",2)), (3,("x",3))).toDF union > Seq((1L,("x",1L)), (2L,("x",2L)), (3L,("x", 3L))).toDF > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'
[ https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17498: Assignee: Apache Spark > StringIndexer.setHandleInvalid should have another option 'new' > --- > > Key: SPARK-17498 > URL: https://issues.apache.org/jira/browse/SPARK-17498 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Miroslav Balaz >Assignee: Apache Spark > > That will map unseen label to maximum known label +1, IndexToString would map > that back to "" or NA if there is something like that in spark, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'
[ https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860812#comment-15860812 ] Apache Spark commented on SPARK-17498: -- User 'VinceShieh' has created a pull request for this issue: https://github.com/apache/spark/pull/16883 > StringIndexer.setHandleInvalid should have another option 'new' > --- > > Key: SPARK-17498 > URL: https://issues.apache.org/jira/browse/SPARK-17498 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Miroslav Balaz > > That will map unseen label to maximum known label +1, IndexToString would map > that back to "" or NA if there is something like that in spark, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'
[ https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17498: Assignee: (was: Apache Spark) > StringIndexer.setHandleInvalid should have another option 'new' > --- > > Key: SPARK-17498 > URL: https://issues.apache.org/jira/browse/SPARK-17498 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Miroslav Balaz > > That will map unseen label to maximum known label +1, IndexToString would map > that back to "" or NA if there is something like that in spark, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19536) Improve capability to merge SQL data types
[ https://issues.apache.org/jira/browse/SPARK-19536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860748#comment-15860748 ] koert kuipers commented on SPARK-19536: --- great, that only leaves maps and nested structs then i think > Improve capability to merge SQL data types > -- > > Key: SPARK-19536 > URL: https://issues.apache.org/jira/browse/SPARK-19536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: koert kuipers >Priority: Minor > > spark's union/merging of compatible types seems kind of weak. it works on > basic types in the top level record, but it fails for nested records, maps, > arrays, etc. > i would like to improve this. > for example i get errors like this: > {noformat} > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. StructType(StructField(_1,StringType,true), > StructField(_2,IntegerType,false)) <> > StructType(StructField(_1,StringType,true), StructField(_2,LongType,false)) > at the first column of the second table > {noformat} > some examples that do work: > {noformat} > scala> Seq(1, 2, 3).toDF union Seq(1L, 2L, 3L).toDF > res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: bigint] > scala> Seq((1,"x"), (2,"x"), (3,"x")).toDF union Seq((1L,"x"), (2L,"x"), > (3L,"x")).toDF > res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: bigint, > _2: string] > {noformat} > what i would also expect to work but currently doesn't: > {noformat} > scala> Seq((Seq(1),"x"), (Seq(2),"x"), (Seq(3),"x")).toDF union > Seq((Seq(1L),"x"), (Seq(2L),"x"), (Seq(3L),"x")).toDF > scala> Seq((1,("x",1)), (2,("x",2)), (3,("x",3))).toDF union > Seq((1L,("x",1L)), (2L,("x",2L)), (3L,("x", 3L))).toDF > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19544) Improve error message when some column types are compatible and others are not in set/union operations
[ https://issues.apache.org/jira/browse/SPARK-19544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19544: Assignee: Apache Spark > Improve error message when some column types are compatible and others are > not in set/union operations > -- > > Key: SPARK-19544 > URL: https://issues.apache.org/jira/browse/SPARK-19544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > Currently, > {code} > Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. LongType <> IntegerType at the first column > of the second table;; > {code} > It prints that it fails due to {{LongType <> IntegerType}} whereas actually > it is {{StructType(...) <> StructType(...)}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19544) Improve error message when some column types are compatible and others are not in set/union operations
[ https://issues.apache.org/jira/browse/SPARK-19544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860719#comment-15860719 ] Apache Spark commented on SPARK-19544: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/16882 > Improve error message when some column types are compatible and others are > not in set/union operations > -- > > Key: SPARK-19544 > URL: https://issues.apache.org/jira/browse/SPARK-19544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, > {code} > Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. LongType <> IntegerType at the first column > of the second table;; > {code} > It prints that it fails due to {{LongType <> IntegerType}} whereas actually > it is {{StructType(...) <> StructType(...)}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19544) Improve error message when some column types are compatible and others are not in set/union operations
[ https://issues.apache.org/jira/browse/SPARK-19544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19544: Assignee: (was: Apache Spark) > Improve error message when some column types are compatible and others are > not in set/union operations > -- > > Key: SPARK-19544 > URL: https://issues.apache.org/jira/browse/SPARK-19544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, > {code} > Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. LongType <> IntegerType at the first column > of the second table;; > {code} > It prints that it fails due to {{LongType <> IntegerType}} whereas actually > it is {{StructType(...) <> StructType(...)}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19544) Improve error message when some column types are compatible and others are not in set/union operations
Hyukjin Kwon created SPARK-19544: Summary: Improve error message when some column types are compatible and others are not in set/union operations Key: SPARK-19544 URL: https://issues.apache.org/jira/browse/SPARK-19544 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Hyukjin Kwon Priority: Minor Currently, {code} Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. LongType <> IntegerType at the first column of the second table;; {code} It prints that it fails due to {{ LongType <> IntegerType}} whereas actually it is {{StructType(...) <> StructType(...)}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19544) Improve error message when some column types are compatible and others are not in set/union operations
[ https://issues.apache.org/jira/browse/SPARK-19544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-19544: - Description: Currently, {code} Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. LongType <> IntegerType at the first column of the second table;; {code} It prints that it fails due to {{LongType <> IntegerType}} whereas actually it is {{StructType(...) <> StructType(...)}} was: Currently, {code} Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. LongType <> IntegerType at the first column of the second table;; {code} It prints that it fails due to {{ LongType <> IntegerType}} whereas actually it is {{StructType(...) <> StructType(...)}} > Improve error message when some column types are compatible and others are > not in set/union operations > -- > > Key: SPARK-19544 > URL: https://issues.apache.org/jira/browse/SPARK-19544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, > {code} > Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. LongType <> IntegerType at the first column > of the second table;; > {code} > It prints that it fails due to {{LongType <> IntegerType}} whereas actually > it is {{StructType(...) <> StructType(...)}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19543) from_json fails when the input row is empty
[ https://issues.apache.org/jira/browse/SPARK-19543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19543: Assignee: (was: Apache Spark) > from_json fails when the input row is empty > > > Key: SPARK-19543 > URL: https://issues.apache.org/jira/browse/SPARK-19543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Burak Yavuz > > Using from_json on a column with an empty string results in: > java.util.NoSuchElementException: head of empty list -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19543) from_json fails when the input row is empty
[ https://issues.apache.org/jira/browse/SPARK-19543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19543: Assignee: Apache Spark > from_json fails when the input row is empty > > > Key: SPARK-19543 > URL: https://issues.apache.org/jira/browse/SPARK-19543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Apache Spark > > Using from_json on a column with an empty string results in: > java.util.NoSuchElementException: head of empty list -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19543) from_json fails when the input row is empty
[ https://issues.apache.org/jira/browse/SPARK-19543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860671#comment-15860671 ] Apache Spark commented on SPARK-19543: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/16881 > from_json fails when the input row is empty > > > Key: SPARK-19543 > URL: https://issues.apache.org/jira/browse/SPARK-19543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Burak Yavuz > > Using from_json on a column with an empty string results in: > java.util.NoSuchElementException: head of empty list -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19543) from_json fails when the input row is empty
Burak Yavuz created SPARK-19543: --- Summary: from_json fails when the input row is empty Key: SPARK-19543 URL: https://issues.apache.org/jira/browse/SPARK-19543 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Burak Yavuz Using from_json on a column with an empty string results in: java.util.NoSuchElementException: head of empty list -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19536) Improve capability to merge SQL data types
[ https://issues.apache.org/jira/browse/SPARK-19536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860665#comment-15860665 ] Hyukjin Kwon commented on SPARK-19536: -- Oh, for {{ArrayType}}, I filed a JIRA, SPARK-19435 and opened a PR, https://github.com/apache/spark/pull/16777 > Improve capability to merge SQL data types > -- > > Key: SPARK-19536 > URL: https://issues.apache.org/jira/browse/SPARK-19536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: koert kuipers >Priority: Minor > > spark's union/merging of compatible types seems kind of weak. it works on > basic types in the top level record, but it fails for nested records, maps, > arrays, etc. > i would like to improve this. > for example i get errors like this: > {noformat} > org.apache.spark.sql.AnalysisException: Union can only be performed on tables > with the compatible column types. StructType(StructField(_1,StringType,true), > StructField(_2,IntegerType,false)) <> > StructType(StructField(_1,StringType,true), StructField(_2,LongType,false)) > at the first column of the second table > {noformat} > some examples that do work: > {noformat} > scala> Seq(1, 2, 3).toDF union Seq(1L, 2L, 3L).toDF > res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: bigint] > scala> Seq((1,"x"), (2,"x"), (3,"x")).toDF union Seq((1L,"x"), (2L,"x"), > (3L,"x")).toDF > res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: bigint, > _2: string] > {noformat} > what i would also expect to work but currently doesn't: > {noformat} > scala> Seq((Seq(1),"x"), (Seq(2),"x"), (Seq(3),"x")).toDF union > Seq((Seq(1L),"x"), (Seq(2L),"x"), (Seq(3L),"x")).toDF > scala> Seq((1,("x",1)), (2,("x",2)), (3,("x",3))).toDF union > Seq((1L,("x",1L)), (2L,("x",2L)), (3L,("x", 3L))).toDF > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19542) Delete the temp checkpoint if a query is stopped without errors
[ https://issues.apache.org/jira/browse/SPARK-19542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860661#comment-15860661 ] Apache Spark commented on SPARK-19542: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16880 > Delete the temp checkpoint if a query is stopped without errors > --- > > Key: SPARK-19542 > URL: https://issues.apache.org/jira/browse/SPARK-19542 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19542) Delete the temp checkpoint if a query is stopped without errors
[ https://issues.apache.org/jira/browse/SPARK-19542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19542: Assignee: Apache Spark (was: Shixiong Zhu) > Delete the temp checkpoint if a query is stopped without errors > --- > > Key: SPARK-19542 > URL: https://issues.apache.org/jira/browse/SPARK-19542 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19542) Delete the temp checkpoint if a query is stopped without errors
[ https://issues.apache.org/jira/browse/SPARK-19542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19542: Assignee: Shixiong Zhu (was: Apache Spark) > Delete the temp checkpoint if a query is stopped without errors > --- > > Key: SPARK-19542 > URL: https://issues.apache.org/jira/browse/SPARK-19542 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19542) Delete the temp checkpoint if a query is stopped without errors
Shixiong Zhu created SPARK-19542: Summary: Delete the temp checkpoint if a query is stopped without errors Key: SPARK-19542 URL: https://issues.apache.org/jira/browse/SPARK-19542 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState
[ https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khamar updated SPARK-19540: - Issue Type: Improvement (was: Story) > Add ability to clone SparkSession wherein cloned session has a reference to > SharedState and an identical copy of the SessionState > - > > Key: SPARK-19540 > URL: https://issues.apache.org/jira/browse/SPARK-19540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Kunal Khamar > > Forking a newSession() from SparkSession currently makes a new SparkSession > that does not retain SessionState (i.e. temporary tables, SQL config, > registered functions etc.) This change adds a method cloneSession() which > creates a new SparkSession with a copy of the parent's SessionState. > Subsequent changes to base session are not propagated to cloned session, > clone is independent after creation. > If the base is changed after clone has been created, say user registers new > UDF, then the new UDF will not be available inside the clone. Same goes for > configs and temp tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19541) High Availability support for ThriftServer
[ https://issues.apache.org/jira/browse/SPARK-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19541: Assignee: Apache Spark > High Availability support for ThriftServer > -- > > Key: SPARK-19541 > URL: https://issues.apache.org/jira/browse/SPARK-19541 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.3.1 >Reporter: LvDeShui >Assignee: Apache Spark > > Currently, We use the spark ThriftServer frequently, and there are many > connects between the client and only ThriftServer.When the ThriftServer is > down ,we cannot get the service again until the server is restarted .So we > need to consider the ThriftServer HA as well as HiveServer HA. For > ThriftServer, we want to import the pattern of HiveServer HA to provide > ThriftServer HA. Therefore, we need to start multiple thrift server which > register on the zookeeper. Then the client can find the thrift server by > just connecting to the zookeeper.So the beeline can get the service from > other thrift server when one is down. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19541) High Availability support for ThriftServer
[ https://issues.apache.org/jira/browse/SPARK-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860545#comment-15860545 ] Apache Spark commented on SPARK-19541: -- User 'lvdongr' has created a pull request for this issue: https://github.com/apache/spark/pull/16879 > High Availability support for ThriftServer > -- > > Key: SPARK-19541 > URL: https://issues.apache.org/jira/browse/SPARK-19541 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.3.1 >Reporter: LvDeShui > > Currently, We use the spark ThriftServer frequently, and there are many > connects between the client and only ThriftServer.When the ThriftServer is > down ,we cannot get the service again until the server is restarted .So we > need to consider the ThriftServer HA as well as HiveServer HA. For > ThriftServer, we want to import the pattern of HiveServer HA to provide > ThriftServer HA. Therefore, we need to start multiple thrift server which > register on the zookeeper. Then the client can find the thrift server by > just connecting to the zookeeper.So the beeline can get the service from > other thrift server when one is down. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19541) High Availability support for ThriftServer
[ https://issues.apache.org/jira/browse/SPARK-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19541: Assignee: (was: Apache Spark) > High Availability support for ThriftServer > -- > > Key: SPARK-19541 > URL: https://issues.apache.org/jira/browse/SPARK-19541 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.3.1 >Reporter: LvDeShui > > Currently, We use the spark ThriftServer frequently, and there are many > connects between the client and only ThriftServer.When the ThriftServer is > down ,we cannot get the service again until the server is restarted .So we > need to consider the ThriftServer HA as well as HiveServer HA. For > ThriftServer, we want to import the pattern of HiveServer HA to provide > ThriftServer HA. Therefore, we need to start multiple thrift server which > register on the zookeeper. Then the client can find the thrift server by > just connecting to the zookeeper.So the beeline can get the service from > other thrift server when one is down. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19539) CREATE TEMPORARY TABLE needs to avoid existing temp view
[ https://issues.apache.org/jira/browse/SPARK-19539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Wu updated SPARK-19539: --- Summary: CREATE TEMPORARY TABLE needs to avoid existing temp view (was: CREATE TEMPORARY TABLE need to avoid existing temp view) > CREATE TEMPORARY TABLE needs to avoid existing temp view > > > Key: SPARK-19539 > URL: https://issues.apache.org/jira/browse/SPARK-19539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xin Wu > > Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to > use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" > clause. However, if there is an existing temporary view defined, it is > possible to unintentionally replace this existing view by issuing "CREATE > TEMPORARY TABLE ... " with the same table/view name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19539) CREATE TEMPORARY TABLE need to avoid existing temp view
[ https://issues.apache.org/jira/browse/SPARK-19539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860512#comment-15860512 ] Apache Spark commented on SPARK-19539: -- User 'xwu0226' has created a pull request for this issue: https://github.com/apache/spark/pull/16878 > CREATE TEMPORARY TABLE need to avoid existing temp view > --- > > Key: SPARK-19539 > URL: https://issues.apache.org/jira/browse/SPARK-19539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xin Wu > > Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to > use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" > clause. However, if there is an existing temporary view defined, it is > possible to unintentionally replace this existing view by issuing "CREATE > TEMPORARY TABLE ... " with the same table/view name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19539) CREATE TEMPORARY TABLE need to avoid existing temp view
[ https://issues.apache.org/jira/browse/SPARK-19539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19539: Assignee: Apache Spark > CREATE TEMPORARY TABLE need to avoid existing temp view > --- > > Key: SPARK-19539 > URL: https://issues.apache.org/jira/browse/SPARK-19539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xin Wu >Assignee: Apache Spark > > Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to > use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" > clause. However, if there is an existing temporary view defined, it is > possible to unintentionally replace this existing view by issuing "CREATE > TEMPORARY TABLE ... " with the same table/view name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19539) CREATE TEMPORARY TABLE need to avoid existing temp view
[ https://issues.apache.org/jira/browse/SPARK-19539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19539: Assignee: (was: Apache Spark) > CREATE TEMPORARY TABLE need to avoid existing temp view > --- > > Key: SPARK-19539 > URL: https://issues.apache.org/jira/browse/SPARK-19539 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xin Wu > > Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to > use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" > clause. However, if there is an existing temporary view defined, it is > possible to unintentionally replace this existing view by issuing "CREATE > TEMPORARY TABLE ... " with the same table/view name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19541) High Availability support for ThriftServer
LvDeShui created SPARK-19541: Summary: High Availability support for ThriftServer Key: SPARK-19541 URL: https://issues.apache.org/jira/browse/SPARK-19541 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.3.1 Reporter: LvDeShui Currently, We use the spark ThriftServer frequently, and there are many connects between the client and only ThriftServer.When the ThriftServer is down ,we cannot get the service again until the server is restarted .So we need to consider the ThriftServer HA as well as HiveServer HA. For ThriftServer, we want to import the pattern of HiveServer HA to provide ThriftServer HA. Therefore, we need to start multiple thrift server which register on the zookeeper. Then the client can find the thrift server by just connecting to the zookeeper.So the beeline can get the service from other thrift server when one is down. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState
[ https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19540: Assignee: Apache Spark > Add ability to clone SparkSession wherein cloned session has a reference to > SharedState and an identical copy of the SessionState > - > > Key: SPARK-19540 > URL: https://issues.apache.org/jira/browse/SPARK-19540 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.1.1 >Reporter: Kunal Khamar >Assignee: Apache Spark > > Forking a newSession() from SparkSession currently makes a new SparkSession > that does not retain SessionState (i.e. temporary tables, SQL config, > registered functions etc.) This change adds a method cloneSession() which > creates a new SparkSession with a copy of the parent's SessionState. > Subsequent changes to base session are not propagated to cloned session, > clone is independent after creation. > If the base is changed after clone has been created, say user registers new > UDF, then the new UDF will not be available inside the clone. Same goes for > configs and temp tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState
[ https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19540: Assignee: (was: Apache Spark) > Add ability to clone SparkSession wherein cloned session has a reference to > SharedState and an identical copy of the SessionState > - > > Key: SPARK-19540 > URL: https://issues.apache.org/jira/browse/SPARK-19540 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.1.1 >Reporter: Kunal Khamar > > Forking a newSession() from SparkSession currently makes a new SparkSession > that does not retain SessionState (i.e. temporary tables, SQL config, > registered functions etc.) This change adds a method cloneSession() which > creates a new SparkSession with a copy of the parent's SessionState. > Subsequent changes to base session are not propagated to cloned session, > clone is independent after creation. > If the base is changed after clone has been created, say user registers new > UDF, then the new UDF will not be available inside the clone. Same goes for > configs and temp tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState
[ https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860508#comment-15860508 ] Apache Spark commented on SPARK-19540: -- User 'kunalkhamar' has created a pull request for this issue: https://github.com/apache/spark/pull/16826 > Add ability to clone SparkSession wherein cloned session has a reference to > SharedState and an identical copy of the SessionState > - > > Key: SPARK-19540 > URL: https://issues.apache.org/jira/browse/SPARK-19540 > Project: Spark > Issue Type: Story > Components: SQL >Affects Versions: 2.1.1 >Reporter: Kunal Khamar > > Forking a newSession() from SparkSession currently makes a new SparkSession > that does not retain SessionState (i.e. temporary tables, SQL config, > registered functions etc.) This change adds a method cloneSession() which > creates a new SparkSession with a copy of the parent's SessionState. > Subsequent changes to base session are not propagated to cloned session, > clone is independent after creation. > If the base is changed after clone has been created, say user registers new > UDF, then the new UDF will not be available inside the clone. Same goes for > configs and temp tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState
Kunal Khamar created SPARK-19540: Summary: Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState Key: SPARK-19540 URL: https://issues.apache.org/jira/browse/SPARK-19540 Project: Spark Issue Type: Story Components: SQL Affects Versions: 2.1.1 Reporter: Kunal Khamar Forking a newSession() from SparkSession currently makes a new SparkSession that does not retain SessionState (i.e. temporary tables, SQL config, registered functions etc.) This change adds a method cloneSession() which creates a new SparkSession with a copy of the parent's SessionState. Subsequent changes to base session are not propagated to cloned session, clone is independent after creation. If the base is changed after clone has been created, say user registers new UDF, then the new UDF will not be available inside the clone. Same goes for configs and temp tables. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19539) CREATE TEMPORARY TABLE need to avoid existing temp view
Xin Wu created SPARK-19539: -- Summary: CREATE TEMPORARY TABLE need to avoid existing temp view Key: SPARK-19539 URL: https://issues.apache.org/jira/browse/SPARK-19539 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Xin Wu Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" clause. However, if there is an existing temporary view defined, it is possible to unintentionally replace this existing view by issuing "CREATE TEMPORARY TABLE ... " with the same table/view name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.
[ https://issues.apache.org/jira/browse/SPARK-19538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19538: Assignee: Apache Spark (was: Kay Ousterhout) > DAGScheduler and TaskSetManager can have an inconsistent view of whether a > stage is complete. > - > > Key: SPARK-19538 > URL: https://issues.apache.org/jira/browse/SPARK-19538 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Assignee: Apache Spark > > The pendingPartitions in Stage tracks partitions that still need to be > computed, and is used by the DAGScheduler to determine when to mark the stage > as complete. In most cases, this variable is exactly consistent with the > tasks in the TaskSetManager (for the current version of the stage) that are > still pending. However, as discussed in SPARK-19263, these can become > inconsistent when an ShuffleMapTask for an earlier attempt of the stage > completes, in which case the DAGScheduler may think the stage has finished, > while the TaskSetManager is still waiting for some tasks to complete (see the > description in this pull request: > https://github.com/apache/spark/pull/16620). This leads to bugs like > SPARK-19263. Another problem with this behavior is that listeners can get > two StageCompleted messages: once when the DAGScheduler thinks the stage is > complete, and a second when the TaskSetManager later decides the stage is > complete. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.
[ https://issues.apache.org/jira/browse/SPARK-19538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860427#comment-15860427 ] Apache Spark commented on SPARK-19538: -- User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/16877 > DAGScheduler and TaskSetManager can have an inconsistent view of whether a > stage is complete. > - > > Key: SPARK-19538 > URL: https://issues.apache.org/jira/browse/SPARK-19538 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > > The pendingPartitions in Stage tracks partitions that still need to be > computed, and is used by the DAGScheduler to determine when to mark the stage > as complete. In most cases, this variable is exactly consistent with the > tasks in the TaskSetManager (for the current version of the stage) that are > still pending. However, as discussed in SPARK-19263, these can become > inconsistent when an ShuffleMapTask for an earlier attempt of the stage > completes, in which case the DAGScheduler may think the stage has finished, > while the TaskSetManager is still waiting for some tasks to complete (see the > description in this pull request: > https://github.com/apache/spark/pull/16620). This leads to bugs like > SPARK-19263. Another problem with this behavior is that listeners can get > two StageCompleted messages: once when the DAGScheduler thinks the stage is > complete, and a second when the TaskSetManager later decides the stage is > complete. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.
[ https://issues.apache.org/jira/browse/SPARK-19538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19538: Assignee: Kay Ousterhout (was: Apache Spark) > DAGScheduler and TaskSetManager can have an inconsistent view of whether a > stage is complete. > - > > Key: SPARK-19538 > URL: https://issues.apache.org/jira/browse/SPARK-19538 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout > > The pendingPartitions in Stage tracks partitions that still need to be > computed, and is used by the DAGScheduler to determine when to mark the stage > as complete. In most cases, this variable is exactly consistent with the > tasks in the TaskSetManager (for the current version of the stage) that are > still pending. However, as discussed in SPARK-19263, these can become > inconsistent when an ShuffleMapTask for an earlier attempt of the stage > completes, in which case the DAGScheduler may think the stage has finished, > while the TaskSetManager is still waiting for some tasks to complete (see the > description in this pull request: > https://github.com/apache/spark/pull/16620). This leads to bugs like > SPARK-19263. Another problem with this behavior is that listeners can get > two StageCompleted messages: once when the DAGScheduler thinks the stage is > complete, and a second when the TaskSetManager later decides the stage is > complete. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage
[ https://issues.apache.org/jira/browse/SPARK-19537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19537: Assignee: Kay Ousterhout (was: Apache Spark) > Move the pendingPartitions variable from Stage to ShuffleMapStage > - > > Key: SPARK-19537 > URL: https://issues.apache.org/jira/browse/SPARK-19537 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > > This variable is only used by ShuffleMapStages, and it is confusing to have > it in the Stage class rather than the ShuffleMapStage class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage
[ https://issues.apache.org/jira/browse/SPARK-19537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860417#comment-15860417 ] Apache Spark commented on SPARK-19537: -- User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/16876 > Move the pendingPartitions variable from Stage to ShuffleMapStage > - > > Key: SPARK-19537 > URL: https://issues.apache.org/jira/browse/SPARK-19537 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > > This variable is only used by ShuffleMapStages, and it is confusing to have > it in the Stage class rather than the ShuffleMapStage class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage
[ https://issues.apache.org/jira/browse/SPARK-19537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19537: Assignee: Apache Spark (was: Kay Ousterhout) > Move the pendingPartitions variable from Stage to ShuffleMapStage > - > > Key: SPARK-19537 > URL: https://issues.apache.org/jira/browse/SPARK-19537 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Assignee: Apache Spark >Priority: Minor > > This variable is only used by ShuffleMapStages, and it is confusing to have > it in the Stage class rather than the ShuffleMapStage class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.
Kay Ousterhout created SPARK-19538: -- Summary: DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete. Key: SPARK-19538 URL: https://issues.apache.org/jira/browse/SPARK-19538 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.1.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout The pendingPartitions in Stage tracks partitions that still need to be computed, and is used by the DAGScheduler to determine when to mark the stage as complete. In most cases, this variable is exactly consistent with the tasks in the TaskSetManager (for the current version of the stage) that are still pending. However, as discussed in SPARK-19263, these can become inconsistent when an ShuffleMapTask for an earlier attempt of the stage completes, in which case the DAGScheduler may think the stage has finished, while the TaskSetManager is still waiting for some tasks to complete (see the description in this pull request: https://github.com/apache/spark/pull/16620). This leads to bugs like SPARK-19263. Another problem with this behavior is that listeners can get two StageCompleted messages: once when the DAGScheduler thinks the stage is complete, and a second when the TaskSetManager later decides the stage is complete. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage
Kay Ousterhout created SPARK-19537: -- Summary: Move the pendingPartitions variable from Stage to ShuffleMapStage Key: SPARK-19537 URL: https://issues.apache.org/jira/browse/SPARK-19537 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.1.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor This variable is only used by ShuffleMapStages, and it is confusing to have it in the Stage class rather than the ShuffleMapStage class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5864) support .jar as python package
[ https://issues.apache.org/jira/browse/SPARK-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860373#comment-15860373 ] Apache Spark commented on SPARK-5864: - User 'kunalkhamar' has created a pull request for this issue: https://github.com/apache/spark/pull/16826 > support .jar as python package > -- > > Key: SPARK-5864 > URL: https://issues.apache.org/jira/browse/SPARK-5864 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.3.0 > > > Support .jar file as python package (same to .zip or .egg) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19536) Improve capability to merge SQL data types
koert kuipers created SPARK-19536: - Summary: Improve capability to merge SQL data types Key: SPARK-19536 URL: https://issues.apache.org/jira/browse/SPARK-19536 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: koert kuipers Priority: Minor spark's union/merging of compatible types seems kind of weak. it works on basic types in the top level record, but it fails for nested records, maps, arrays, etc. i would like to improve this. for example i get errors like this: {noformat} org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. StructType(StructField(_1,StringType,true), StructField(_2,IntegerType,false)) <> StructType(StructField(_1,StringType,true), StructField(_2,LongType,false)) at the first column of the second table {noformat} some examples that do work: {noformat} scala> Seq(1, 2, 3).toDF union Seq(1L, 2L, 3L).toDF res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: bigint] scala> Seq((1,"x"), (2,"x"), (3,"x")).toDF union Seq((1L,"x"), (2L,"x"), (3L,"x")).toDF res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: bigint, _2: string] {noformat} what i would also expect to work but currently doesn't: {noformat} scala> Seq((Seq(1),"x"), (Seq(2),"x"), (Seq(3),"x")).toDF union Seq((Seq(1L),"x"), (Seq(2L),"x"), (Seq(3L),"x")).toDF scala> Seq((1,("x",1)), (2,("x",2)), (3,("x",3))).toDF union Seq((1L,("x",1L)), (2L,("x",2L)), (3L,("x", 3L))).toDF {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16713) Limit codegen method size to 8KB
[ https://issues.apache.org/jira/browse/SPARK-16713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-16713. - Resolution: Won't Fix According to the discussion on PR, I think we can close this as WON'T FIX for now. > Limit codegen method size to 8KB > > > Key: SPARK-16713 > URL: https://issues.apache.org/jira/browse/SPARK-16713 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 2.0.0 >Reporter: Qifan Pu > > Ideally, we would wish codegen methods to be less than 8KB for bytecode size. > Beyond 8K JIT won't compile and can cause performance degradation. We have > seen this for queries with wide schema (30+ fields), where > agg_doAggregateWithKeys() can be more than 8K. This is also a major reason > for performance regression when we enable fash aggregate hashmap (such as > using VectorizedHashMapGenerator.scala). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints
[ https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860236#comment-15860236 ] Shixiong Zhu commented on SPARK-19525: -- [~rameshaaditya117] Sounds a good idea. I thought the metadata checkpoint should be very small. How large of your checkpoint files? > Enable Compression of Spark Streaming Checkpoints > - > > Key: SPARK-19525 > URL: https://issues.apache.org/jira/browse/SPARK-19525 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Aaditya Ramesh > > In our testing, compressing partitions while writing them to checkpoints on > HDFS using snappy helped performance significantly while also reducing the > variability of the checkpointing operation. In our tests, checkpointing time > was reduced by 3X, and variability was reduced by 2X for data sets of > compressed size approximately 1 GB. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory
[ https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860220#comment-15860220 ] Srikanth Daggumalli commented on SPARK-13510: - OOM is not an issue. But would like to know about the following ERRORs 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from /10.196.134.220:7337 is closed 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block shuffle_3_81_2, and will not retry (0 retries) why does they written in logs as ERROR? > Shuffle may throw FetchFailedException: Direct buffer memory > > > Key: SPARK-13510 > URL: https://issues.apache.org/jira/browse/SPARK-13510 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Hong Shen > Attachments: spark-13510.diff > > > In our cluster, when I test spark-1.6.0 with a sql, it throw exception and > failed. > {code} > 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request > for 1 blocks (915.4 MB) from 10.196.134.220:7337 > 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch > from 10.196.134.220:7337 (executor id 122) > 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 > to /10.196.134.220:7337 > 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in > connection from /10.196.134.220:7337 > java.lang.OutOfMemoryError: Direct buffer memory > at java.nio.Bits.reserveMemory(Bits.java:658) > at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) > at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) > at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) > at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > at > io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) > at > io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) > at > io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) > at > io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:744) > 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 > requests outstanding when connection from /10.196.134.220:7337 is closed > 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block > shuffle_3_81_2, and will not retry (0 retries) > {code} > The reason is that when shuffle a big block(like 1G), task will allocate > the same memory, it will easily throw "FetchFailedException: Direct buffer > memory". > If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will > throw > {code} > java.lang.OutOfMemoryError: Java heap space > at > io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607) > at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:215) > at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) > {code} > > In mapreduce shuffle, it will firstly judge whether the block can cache in > memery, but spark doesn't. > If the block is more than we can cache in memory, we should write to disk. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data
[ https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860110#comment-15860110 ] Joseph K. Bradley commented on SPARK-10802: --- Linking related issue for feature parity in DataFrame-based API. > Let ALS recommend for subset of data > > > Key: SPARK-10802 > URL: https://issues.apache.org/jira/browse/SPARK-10802 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Tomasz Bartczak >Priority: Minor > > Currently MatrixFactorizationModel allows to get recommendations for > - single user > - single product > - all users > - all products > recommendation for all users/products do a cartesian join inside. > It would be useful in some cases to get recommendations for subset of > users/products by providing an RDD with which MatrixFactorizationModel could > do an intersection before doing a cartesian join. This would make it much > faster in situation where recommendations are needed only for subset of > users/products, and when the subset is still too large to make it feasible to > recommend one-by-one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19535) ALSModel recommendAll analogs
Joseph K. Bradley created SPARK-19535: - Summary: ALSModel recommendAll analogs Key: SPARK-19535 URL: https://issues.apache.org/jira/browse/SPARK-19535 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.2.0 Reporter: Joseph K. Bradley Assignee: Sue Ann Hong Add methods analogous to the spark.mllib MatrixFactorizationModel methods recommendProductsForUsers/UsersForProducts. The initial implementation should be very simple, using DataFrame joins. Future work can add optimizations. I recommend naming them: * recommendForAllUsers * recommendForAllItems -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB
[ https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860108#comment-15860108 ] Joseph K. Bradley commented on SPARK-13857: --- Hi all, catching up these many ALS discussions now. This work to support evaluation and tuning for recommendation is great, but I'm worried about it not being resolved in time for 2.2. I've heard a lot of requests for the plain functionality available in spark.mllib for recommendUsers/Products, so I'd recommend we just add those methods for now as a short-term solution. Let's keep working on the evaluation/tuning plans too. I'll create a JIRA for adding basic recommendUsers/Products methods. > Feature parity for ALS ML with MLLIB > > > Key: SPARK-13857 > URL: https://issues.apache.org/jira/browse/SPARK-13857 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Nick Pentreath >Assignee: Nick Pentreath > > Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods > {{recommendProducts/recommendUsers}} for recommending top K to a given user / > item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to > recommend top K across all users/items. > Additionally, SPARK-10802 is for adding the ability to do > {{recommendProductsForUsers}} for a subset of users (or vice versa). > Look at exposing or porting (as appropriate) these methods to ALS in ML. > Investigate if efficiency can be improved at the same time (see SPARK-11968). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19509) GROUPING SETS throws NullPointerException when use an empty column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19509. --- Resolution: Fixed Assignee: StanZhai Fix Version/s: 2.1.1 2.0.3 > GROUPING SETS throws NullPointerException when use an empty column > -- > > Key: SPARK-19509 > URL: https://issues.apache.org/jira/browse/SPARK-19509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Assignee: StanZhai > Fix For: 2.0.3, 2.1.1 > > > {code:sql|title=A simple case} > select count(1) from test group by e grouping sets(e) > {code} > {code:title=Schema of the test table} > scala> spark.sql("desc test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | e| string| null| > ++-+---+ > {code} > {code:sql|title=The column `e` is empty} > scala> spark.sql("select e from test").show() > ++ > | e| > ++ > |null| > |null| > ++ > {code} > {code:title=Exception} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) > at org.apache.spark.sql.Dataset.show(Dataset.scala:636) > at org.apache.spark.sql.Dataset.show(Dataset.scala:595) > at org.apache.spark.sql.Dataset.show(Dataset.scala:604) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at >
[jira] [Commented] (SPARK-19512) codegen for compare structs fails
[ https://issues.apache.org/jira/browse/SPARK-19512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860053#comment-15860053 ] Apache Spark commented on SPARK-19512: -- User 'bogdanrdc' has created a pull request for this issue: https://github.com/apache/spark/pull/16875 > codegen for compare structs fails > - > > Key: SPARK-19512 > URL: https://issues.apache.org/jira/browse/SPARK-19512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu > Fix For: 2.2.0 > > > This (1 struct field) > {code:java|title=1 struct field} > spark.range(10) > .selectExpr("named_struct('a', id) as col1", "named_struct('a', id+2) > as col2") > .filter("col1 = col2").count > {code} > fails with > {code} > [info] Cause: java.util.concurrent.ExecutionException: java.lang.Exception: > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 144, Column 32: Expression "range_value" is not an > rvalue > {code} > This (2 struct fields) > {code:java|title=2 struct fields} > spark.range(10) > .selectExpr("named_struct('a', id, 'b', id) as col1", > "named_struct('a',id+2, 'b',id+2) as col2") > .filter($"col1" === $"col2").count > {code} > fails with > {code} > Caused by: java.lang.IndexOutOfBoundsException: 1 > at > scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65) > at scala.collection.immutable.List.apply(List.scala:84) > at > org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:64) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory
[ https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860028#comment-15860028 ] Shixiong Zhu commented on SPARK-19523: -- You can create a HiveContext before creating StreamingContext. Then in your functions, you can use "SQLContext.getOrCreate(rdd.sparkContext)" to get the HiveContext. Although it returns a SQLContext, but the real type is actually HiveContext. > Spark streaming+ insert into table leaves bunch of trash in table directory > --- > > Key: SPARK-19523 > URL: https://issues.apache.org/jira/browse/SPARK-19523 > Project: Spark > Issue Type: Improvement > Components: DStreams, SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov >Priority: Minor > > I have very simple code, which transform coming json files into pq table: > {code} > import org.apache.spark.sql.hive.HiveContext > import org.apache.hadoop.fs.Path > import org.apache.hadoop.io.{LongWritable, Text} > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat > import org.apache.spark.sql.SaveMode > object Client_log { > def main(args: Array[String]): Unit = { > val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * > from temp.x_streaming where year=2015 and month=12 and day=1").dtypes > var columns = resultCols.filter(x => > !Commons.stopColumns.contains(x._1)).map({ case (name, types) => { > s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as > ${Commons.mapType(types)}) as $name""" > } > }) > columns ++= List("'streaming' as sourcefrom") > def f(path:Path): Boolean = { > true > } > val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, > TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false) > client_log_d_stream.foreachRDD(rdd => { > val localHiveContext = new HiveContext(rdd.sparkContext) > import localHiveContext.implicits._ > var input = rdd.map(x => Record(x._2.toString)).toDF() > input = input.selectExpr(columns: _*) > input = > SmallOperators.populate(input, resultCols) > input > .write > .mode(SaveMode.Append) > .format("parquet") > .insertInto("temp.x_streaming") > }) > Spark.ssc.start() > Spark.ssc.awaitTermination() > } > case class Record(s: String) > } > {code} > This code generates a lot of trash directories in resalt table like: > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory
[ https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860025#comment-15860025 ] Egor Pahomov commented on SPARK-19523: -- Probably my bad. I have {code} rdd.sparkContext {code} inside lambda. I do not have {code} rdd.sqlContext {code}. where I can take it? > Spark streaming+ insert into table leaves bunch of trash in table directory > --- > > Key: SPARK-19523 > URL: https://issues.apache.org/jira/browse/SPARK-19523 > Project: Spark > Issue Type: Improvement > Components: DStreams, SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov >Priority: Minor > > I have very simple code, which transform coming json files into pq table: > {code} > import org.apache.spark.sql.hive.HiveContext > import org.apache.hadoop.fs.Path > import org.apache.hadoop.io.{LongWritable, Text} > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat > import org.apache.spark.sql.SaveMode > object Client_log { > def main(args: Array[String]): Unit = { > val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * > from temp.x_streaming where year=2015 and month=12 and day=1").dtypes > var columns = resultCols.filter(x => > !Commons.stopColumns.contains(x._1)).map({ case (name, types) => { > s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as > ${Commons.mapType(types)}) as $name""" > } > }) > columns ++= List("'streaming' as sourcefrom") > def f(path:Path): Boolean = { > true > } > val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, > TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false) > client_log_d_stream.foreachRDD(rdd => { > val localHiveContext = new HiveContext(rdd.sparkContext) > import localHiveContext.implicits._ > var input = rdd.map(x => Record(x._2.toString)).toDF() > input = input.selectExpr(columns: _*) > input = > SmallOperators.populate(input, resultCols) > input > .write > .mode(SaveMode.Append) > .format("parquet") > .insertInto("temp.x_streaming") > }) > Spark.ssc.start() > Spark.ssc.awaitTermination() > } > case class Record(s: String) > } > {code} > This code generates a lot of trash directories in resalt table like: > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860023#comment-15860023 ] Egor Pahomov commented on SPARK-19524: -- I'm really confused. I expected "new" to be the files created after start of streaming job and old ones to be everything else in the folder. If we change definition of "new", than I believe everything consistent between each other. It's just I'm not sure that this "new" definition is very intuitive. I want to process everything in folder - existing and upcoming. I use this flag. And now it turns out, that this flag has it's own definition of "new". My be I'm not correct to call it a bug, but isn't it all very confusing for person, who does not really know who everything works inside? > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner
[ https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19481. Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16825 [https://github.com/apache/spark/pull/16825] > Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in > ClosureCleaner > - > > Key: SPARK-19481 > URL: https://issues.apache.org/jira/browse/SPARK-19481 > Project: Spark > Issue Type: Test > Components: Spark Shell >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 2.1.1, 2.2.0 > > > org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the > tests unstable. See: > http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10141) Number of tasks on executors still become negative after failures
[ https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-10141. - Resolution: Done > Number of tasks on executors still become negative after failures > - > > Key: SPARK-10141 > URL: https://issues.apache.org/jira/browse/SPARK-10141 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Minor > Attachments: Screen Shot 2015-08-20 at 3.14.49 PM.png > > > I hit this failure when running LDA on EC2 (after I made the model size > really big). > I was using the LDAExample.scala code on an EC2 cluster with 16 workers > (r3.2xlarge), on a Wikipedia dataset: > {code} > Training set size (documents) 4534059 > Vocabulary size (terms) 1 > Training set size (tokens)895575317 > EM optimizer > 1K topics > {code} > Failure message: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in > stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 > (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to > /10.0.202.128:54740 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ... 1 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805) > at
[jira] [Assigned] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-17975: - Assignee: Tathagata Das > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein >Assignee: Tathagata Das > Fix For: 2.0.3, 2.1.1, 2.2.0 > > Attachments: docs.txt > > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail:
[jira] [Resolved] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-17975. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 2.0.3 > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein > Fix For: 2.0.3, 2.1.1, 2.2.0 > > Attachments: docs.txt > > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe,
[jira] [Updated] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:
[ https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14804: -- Fix Version/s: (was: 3.0.0) 2.2.0 > Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: > --- > > Key: SPARK-14804 > URL: https://issues.apache.org/jira/browse/SPARK-14804 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.1 >Reporter: SuYan >Assignee: Tathagata Das >Priority: Minor > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > {code} > graph3.vertices.checkpoint() > graph3.vertices.count() > graph3.vertices.map(_._2).count() > {code} > 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 > (TID 13, localhost): java.lang.ClassCastException: > org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to > scala.Tuple2 > at > com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:91) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > look at the code: > {code} > private[spark] def computeOrReadCheckpoint(split: Partition, context: > TaskContext): Iterator[T] = > { > if (isCheckpointedAndMaterialized) { > firstParent[T].iterator(split, context) > } else { > compute(split, context) > } > } > private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed > override def isCheckpointed: Boolean = { >firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed > } > {code} > for VertexRDD or EdgeRDD, first parent is its partitionRDD > RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])] > 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so > VertexRDD.isCheckpointedAndMaterialized=true. > 2. then we call vertexRDD.iterator, because checkoint=true it called > firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). > > so returned iterator is iterator[ShippableVertexPartition] not expect > iterator[(VertexId, VD)]] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16554) Spark should kill executors when they are blacklisted
[ https://issues.apache.org/jira/browse/SPARK-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-16554. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16650 [https://github.com/apache/spark/pull/16650] > Spark should kill executors when they are blacklisted > - > > Key: SPARK-16554 > URL: https://issues.apache.org/jira/browse/SPARK-16554 > Project: Spark > Issue Type: New Feature > Components: Scheduler >Reporter: Imran Rashid >Assignee: Jose Soltren > Fix For: 2.2.0 > > > SPARK-8425 will allow blacklisting faulty executors and nodes. However, > these blacklisted executors will continue to run. This is bad for a few > reasons: > (1) Even if there is faulty-hardware, if the cluster is under-utilized spark > may be able to request another executor on a different node. > (2) If there is a faulty-disk (the most common case of faulty-hardware), the > cluster manager may be able to allocate another executor on the same node, if > it can exclude the bad disk. (Yarn will do this with its disk-health > checker.) > With dynamic allocation, this may seem less critical, as a blacklisted > executor will stop running new tasks and eventually get reclaimed. However, > if there is cached data on those executors, they will not get killed till > {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} expires, which is > (effectively) infinite by default. > Users may not *always* want to kill bad executors, so this must be > configurable to some extent. At a minimum, it should be possible to enable / > disable it; perhaps the executor should be killed after it has been > blacklisted a configurable {{N}} times. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN
[ https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859984#comment-15859984 ] Joseph K. Bradley commented on SPARK-17975: --- Will do, thanks! > EMLDAOptimizer fails with ClassCastException on YARN > > > Key: SPARK-17975 > URL: https://issues.apache.org/jira/browse/SPARK-17975 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 > Environment: Centos 6, CDH 5.7, Java 1.7u80 >Reporter: Jeff Stein > Attachments: docs.txt > > > I'm able to reproduce the error consistently with a 2000 record text file > with each record having 1-5 terms and checkpointing enabled. It looks like > the problem was introduced with the resolution for SPARK-13355. > The EdgeRDD class seems to be lying about it's type in a way that causes > RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an > RDD of Edge elements. > {code} > val spark = SparkSession.builder.appName("lda").getOrCreate() > spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints") > val data: RDD[(Long, Vector)] = // snip > data.setName("data").cache() > val lda = new LDA > val optimizer = new EMLDAOptimizer > lda.setOptimizer(optimizer) > .setK(10) > .setMaxIterations(400) > .setAlpha(-1) > .setBeta(-1) > .setCheckpointInterval(7) > val ldaModel = lda.run(data) > {code} > {noformat} > 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID > 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be > cast to org.apache.spark.graphx.Edge > at > org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) > at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332) > at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670) > at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:281) > at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:722) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-10141) Number of tasks on executors still become negative after failures
[ https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859982#comment-15859982 ] Joseph K. Bradley commented on SPARK-10141: --- I'll close this if no one has seen it in Spark 2.0 or 2.1. Thanks all! > Number of tasks on executors still become negative after failures > - > > Key: SPARK-10141 > URL: https://issues.apache.org/jira/browse/SPARK-10141 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Joseph K. Bradley >Priority: Minor > Attachments: Screen Shot 2015-08-20 at 3.14.49 PM.png > > > I hit this failure when running LDA on EC2 (after I made the model size > really big). > I was using the LDAExample.scala code on an EC2 cluster with 16 workers > (r3.2xlarge), on a Wikipedia dataset: > {code} > Training set size (documents) 4534059 > Vocabulary size (terms) 1 > Training set size (tokens)895575317 > EM optimizer > 1K topics > {code} > Failure message: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in > stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 > (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to > /10.0.202.128:54740 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156) > at > org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43) > at > org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740 > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > at > io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224) > at > io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > ... 1 more > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554) > at
[jira] [Resolved] (SPARK-19025) Remove SQL builder for operators
[ https://issues.apache.org/jira/browse/SPARK-19025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19025. --- Resolution: Fixed Assignee: Jiang Xingbo Fix Version/s: 2.2.0 > Remove SQL builder for operators > > > Key: SPARK-19025 > URL: https://issues.apache.org/jira/browse/SPARK-19025 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Jiang Xingbo >Assignee: Jiang Xingbo > Fix For: 2.2.0 > > > With the new approach of view resolution, we can get rid of SQL generation on > view creation, so let's remove SQL builder for operators. > Note that, since all sql generation for operators is defined in one file > (org.apache.spark.sql.catalyst.SQLBuilder), it’d be trivial to recover it in > the future. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19512) codegen for compare structs fails
[ https://issues.apache.org/jira/browse/SPARK-19512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19512. --- Resolution: Fixed Assignee: Bogdan Raducanu Fix Version/s: 2.2.0 > codegen for compare structs fails > - > > Key: SPARK-19512 > URL: https://issues.apache.org/jira/browse/SPARK-19512 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Bogdan Raducanu >Assignee: Bogdan Raducanu > Fix For: 2.2.0 > > > This (1 struct field) > {code:java|title=1 struct field} > spark.range(10) > .selectExpr("named_struct('a', id) as col1", "named_struct('a', id+2) > as col2") > .filter("col1 = col2").count > {code} > fails with > {code} > [info] Cause: java.util.concurrent.ExecutionException: java.lang.Exception: > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 144, Column 32: Expression "range_value" is not an > rvalue > {code} > This (2 struct fields) > {code:java|title=2 struct fields} > spark.range(10) > .selectExpr("named_struct('a', id, 'b', id) as col1", > "named_struct('a',id+2, 'b',id+2) as col2") > .filter($"col1" === $"col2").count > {code} > fails with > {code} > Caused by: java.lang.IndexOutOfBoundsException: 1 > at > scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65) > at scala.collection.immutable.List.apply(List.scala:84) > at > org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:64) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19514) Range is not interruptible
[ https://issues.apache.org/jira/browse/SPARK-19514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-19514. - Resolution: Fixed Assignee: Ala Luszczak Fix Version/s: 2.2.0 > Range is not interruptible > -- > > Key: SPARK-19514 > URL: https://issues.apache.org/jira/browse/SPARK-19514 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ala Luszczak >Assignee: Ala Luszczak > Fix For: 2.2.0 > > > Currently Range cannot be interrupted. > For example, if you start executing > spark.range(0, A_LOT, 1).crossJoin(spark.range(0, A_LOT, 1)).count() > and then call > DAGScheduler.cancellStage(...) > the execution won't stop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18005) optional binary Dataframe Column throws (UTF8) is not a group while loading a Dataframe
[ https://issues.apache.org/jira/browse/SPARK-18005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859913#comment-15859913 ] Eric Maynard commented on SPARK-18005: -- This bug appears in the 1.6.x branch as well. > optional binary Dataframe Column throws (UTF8) is not a group while loading a > Dataframe > --- > > Key: SPARK-18005 > URL: https://issues.apache.org/jira/browse/SPARK-18005 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 >Reporter: ABHISHEK CHOUDHARY > > In some scenario, while loading a Parquet file, spark is throwing exception > as- > java.lang.ClassCastException: optional binary CertificateChains (UTF8) is not > a group > Entire Dataframe is not corrupted as I managed to load starting 20 rows of > the data but trying to load the next one throws the error and any operations > over entire dataset throws the same exception like count. > Full Exception Stack - > {quote} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in > stage 594.0 failed 4 times, most recent failure: Lost task 2.3 in stage 594.0 > (TID 6726, ): java.lang.ClassCastException: optional binary CertificateChains > (UTF8) is not a group > at org.apache.parquet.schema.Type.asGroupType(Type.java:202) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:122) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at org.apache.spark.sql.types.StructType.map(StructType.scala:95) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:111) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:168) > at > org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:192) > at > org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryRelation.scala:151) > at > org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213) > at > org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919) > at >
[jira] [Comment Edited] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859795#comment-15859795 ] Christophe Préaud edited comment on SPARK-19477 at 2/9/17 5:01 PM: --- As a workaround, mapping the dataset with *identity* fix the issue: {code:java} scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ | a| b| c| x| | d| e| f| y| | h| i| j| z| +---+---+---+---+ scala> ds.map(identity).show +---+---+---+ | f1| f2| f3| +---+---+---+ | a| b| c| | d| e| f| | h| i| j| +---+---+---+ {code} was (Author: preaudc): As a workaround, mapping the dataset with *identity* fix the issue: {code:scala} scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ | a| b| c| x| | d| e| f| y| | h| i| j| z| +---+---+---+---+ scala> ds.map(identity).show +---+---+---+ | f1| f2| f3| +---+---+---+ | a| b| c| | d| e| f| | h| i| j| +---+---+---+ {code} > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christophe Préaud updated SPARK-19477: -- Comment: was deleted (was: As a workaround, mapping the dataset with *identity* fix the issue: {code:java} scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ | a| b| c| x| | d| e| f| y| | h| i| j| z| +---+---+---+---+ scala> ds.map(identity).show +---+---+---+ | f1| f2| f3| +---+---+---+ | a| b| c| | d| e| f| | h| i| j| +---+---+---+ {code}) > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859795#comment-15859795 ] Christophe Préaud edited comment on SPARK-19477 at 2/9/17 5:00 PM: --- As a workaround, mapping the dataset with *identity* fix the issue: {code:scala} scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ | a| b| c| x| | d| e| f| y| | h| i| j| z| +---+---+---+---+ scala> ds.map(identity).show +---+---+---+ | f1| f2| f3| +---+---+---+ | a| b| c| | d| e| f| | h| i| j| +---+---+---+ {code} was (Author: preaudc): As a workaround, mapping the dataset with *identity* fix the issue: ``` scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ | a| b| c| x| | d| e| f| y| | h| i| j| z| +---+---+---+---+ scala> ds.map(identity).show +---+---+---+ | f1| f2| f3| +---+---+---+ | a| b| c| | d| e| f| | h| i| j| +---+---+---+ ``` > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859795#comment-15859795 ] Christophe Préaud edited comment on SPARK-19477 at 2/9/17 4:59 PM: --- As a workaround, mapping the dataset with *identity* fix the issue: ``` scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ | a| b| c| x| | d| e| f| y| | h| i| j| z| +---+---+---+---+ scala> ds.map(identity).show +---+---+---+ | f1| f2| f3| +---+---+---+ | a| b| c| | d| e| f| | h| i| j| +---+---+---+ ``` was (Author: preaudc): As a workaround, mapping the dataset with *identity* fix the issue: ```scala scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ | a| b| c| x| | d| e| f| y| | h| i| j| z| +---+---+---+---+ scala> ds.map(identity).show +---+---+---+ | f1| f2| f3| +---+---+---+ | a| b| c| | d| e| f| | h| i| j| +---+---+---+ ``` > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns
[ https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859795#comment-15859795 ] Christophe Préaud commented on SPARK-19477: --- As a workaround, mapping the dataset with *identity* fix the issue: ```scala scala> ds.show +---+---+---+---+ | f1| f2| f3| c4| +---+---+---+---+ | a| b| c| x| | d| e| f| y| | h| i| j| z| +---+---+---+---+ scala> ds.map(identity).show +---+---+---+ | f1| f2| f3| +---+---+---+ | a| b| c| | d| e| f| | h| i| j| +---+---+---+ ``` > [SQL] Datasets created from a Dataframe with extra columns retain the extra > columns > --- > > Key: SPARK-19477 > URL: https://issues.apache.org/jira/browse/SPARK-19477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Drake > > In 1.6, when you created a Dataset from a Dataframe that had extra columns, > the columns not in the case class were dropped from the Dataset. > For example in 1.6, the column c4 is gone: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import sqlContext.implicits._ > import sqlContext.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: > string] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string] > scala> ds.show > +---+---+---+ > | f1| f2| f3| > +---+---+---+ > | a| b| c| > | d| e| f| > | h| i| j| > {code} > This seems to have changed in Spark 2.0 and also 2.1: > Spark 2.1.0: > {code} > scala> case class F(f1: String, f2: String, f3:String) > defined class F > scala> import spark.implicits._ > import spark.implicits._ > scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", > "j","z")).toDF("f1", "f2", "f3", "c4") > df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more > fields] > scala> val ds = df.as[F] > ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more > fields] > scala> ds.show > +---+---+---+---+ > | f1| f2| f3| c4| > +---+---+---+---+ > | a| b| c| x| > | d| e| f| y| > | h| i| j| z| > +---+---+---+---+ > scala> import org.apache.spark.sql.Encoders > import org.apache.spark.sql.Encoders > scala> val fEncoder = Encoders.product[F] > fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: > string, f3[0]: string] > scala> fEncoder.schema == ds.schema > res2: Boolean = false > scala> ds.schema > res3: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true), StructField(c4,StringType,true)) > scala> fEncoder.schema > res4: org.apache.spark.sql.types.StructType = > StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), > StructField(f3,StringType,true)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use
[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859737#comment-15859737 ] Peter D Kirchner commented on SPARK-19348: -- Per the above, perhaps a fix could address both threadsafety and the orphaned modifications to variables in the wrapped function bodies. For instance, in Pipeline.__init__() and setParams() the fix could remove references to _input_kwargs and instead invoke setParams(stages=stages) within __init__() and _set(stages=stages) within setParams(), respectively, Parallel changes would be needed in all wrapped functions but the resulting code would be functional, readable, and threadsafe. A fix for Pipeline alone is insufficient, because multiple pipelines could have stages consisting of instances of the same class, e.g. LogisticRegression. All the classes using @keyword_only need to be addressed by whatever fix is decided upon. > pyspark.ml.Pipeline gets corrupted under multi threaded use > --- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Vinayak Joshi > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-19494) Remove java7 pom profile
[ https://issues.apache.org/jira/browse/SPARK-19494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin deleted SPARK-19494: > Remove java7 pom profile > > > Key: SPARK-19494 > URL: https://issues.apache.org/jira/browse/SPARK-19494 > Project: Spark > Issue Type: Sub-task >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19494) Remove java7 pom profile
[ https://issues.apache.org/jira/browse/SPARK-19494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19494: -- Priority: Trivial (was: Major) (I think this part is pretty trivial and covered in the base change for the parent JIRA.) > Remove java7 pom profile > > > Key: SPARK-19494 > URL: https://issues.apache.org/jira/browse/SPARK-19494 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19534) Convert Java tests to use lambdas, Java 8 features
[ https://issues.apache.org/jira/browse/SPARK-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19534: -- Issue Type: Sub-task (was: Task) Parent: SPARK-19493 > Convert Java tests to use lambdas, Java 8 features > -- > > Key: SPARK-19534 > URL: https://issues.apache.org/jira/browse/SPARK-19534 > Project: Spark > Issue Type: Sub-task > Components: Tests >Affects Versions: 2.2.0 >Reporter: Sean Owen >Assignee: Sean Owen > > Likewise, Java tests can be simplified by use of Java 8 lambdas. This is a > significant sub-task in its own right. This shouldn't mean that 'old' APIs go > untested because there are no separate Java 8 APIs; it's just syntactic sugar > for calls to the same APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19534) Convert Java tests to use lambdas, Java 8 features
Sean Owen created SPARK-19534: - Summary: Convert Java tests to use lambdas, Java 8 features Key: SPARK-19534 URL: https://issues.apache.org/jira/browse/SPARK-19534 Project: Spark Issue Type: Task Components: Tests Affects Versions: 2.2.0 Reporter: Sean Owen Assignee: Sean Owen Likewise, Java tests can be simplified by use of Java 8 lambdas. This is a significant sub-task in its own right. This shouldn't mean that 'old' APIs go untested because there are no separate Java 8 APIs; it's just syntactic sugar for calls to the same APIs. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19533) Convert Java examples to use lambdas, Java 8 features
Sean Owen created SPARK-19533: - Summary: Convert Java examples to use lambdas, Java 8 features Key: SPARK-19533 URL: https://issues.apache.org/jira/browse/SPARK-19533 Project: Spark Issue Type: Task Components: Examples Affects Versions: 2.2.0 Reporter: Sean Owen Assignee: Sean Owen As a subtask of the overall migration to Java 8, we can and probably should use Java 8 lambdas to simplify the Java examples. I'm marking this as a subtask in its own right because it's a pretty big change by lines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19533) Convert Java examples to use lambdas, Java 8 features
[ https://issues.apache.org/jira/browse/SPARK-19533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19533: -- Issue Type: Sub-task (was: Task) Parent: SPARK-19493 > Convert Java examples to use lambdas, Java 8 features > - > > Key: SPARK-19533 > URL: https://issues.apache.org/jira/browse/SPARK-19533 > Project: Spark > Issue Type: Sub-task > Components: Examples >Affects Versions: 2.2.0 >Reporter: Sean Owen >Assignee: Sean Owen > > As a subtask of the overall migration to Java 8, we can and probably should > use Java 8 lambdas to simplify the Java examples. I'm marking this as a > subtask in its own right because it's a pretty big change by lines. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true
StanZhai created SPARK-19532: Summary: [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true Key: SPARK-19532 URL: https://issues.apache.org/jira/browse/SPARK-19532 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.1.0 Reporter: StanZhai Priority: Blocker When set `spark.speculation` to true, from thread dump page of Executor of WebUI, I found that there are about 1300 threads named "DataStreamer for file /test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet" in TIMED_WAITING state. {code} java.lang.Object.wait(Native Method) org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564) {code} The off-heap memory exceeds a lot until Executor exited with OOM exception. This problem occurs only when writing data to the Hadoop(tasks may be killed by Executor during writing). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-19509: -- Priority: Major (was: Critical) > [SQL]GROUPING SETS throws NullPointerException when use an empty column > --- > > Key: SPARK-19509 > URL: https://issues.apache.org/jira/browse/SPARK-19509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai > > {code:sql|title=A simple case} > select count(1) from test group by e grouping sets(e) > {code} > {code:title=Schema of the test table} > scala> spark.sql("desc test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | e| string| null| > ++-+---+ > {code} > {code:sql|title=The column `e` is empty} > scala> spark.sql("select e from test").show() > ++ > | e| > ++ > |null| > |null| > ++ > {code} > {code:title=Exception} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) > at org.apache.spark.sql.Dataset.show(Dataset.scala:636) > at org.apache.spark.sql.Dataset.show(Dataset.scala:595) > at org.apache.spark.sql.Dataset.show(Dataset.scala:604) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at
[jira] [Updated] (SPARK-19509) GROUPING SETS throws NullPointerException when use an empty column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-19509: -- Summary: GROUPING SETS throws NullPointerException when use an empty column (was: [SQL]GROUPING SETS throws NullPointerException when use an empty column) > GROUPING SETS throws NullPointerException when use an empty column > -- > > Key: SPARK-19509 > URL: https://issues.apache.org/jira/browse/SPARK-19509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai > > {code:sql|title=A simple case} > select count(1) from test group by e grouping sets(e) > {code} > {code:title=Schema of the test table} > scala> spark.sql("desc test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | e| string| null| > ++-+---+ > {code} > {code:sql|title=The column `e` is empty} > scala> spark.sql("select e from test").show() > ++ > | e| > ++ > |null| > |null| > ++ > {code} > {code:title=Exception} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) > at org.apache.spark.sql.Dataset.show(Dataset.scala:636) > at org.apache.spark.sql.Dataset.show(Dataset.scala:595) > at org.apache.spark.sql.Dataset.show(Dataset.scala:604) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at >
[jira] [Commented] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859541#comment-15859541 ] Apache Spark commented on SPARK-19509: -- User 'stanzhai' has created a pull request for this issue: https://github.com/apache/spark/pull/16874 > [SQL]GROUPING SETS throws NullPointerException when use an empty column > --- > > Key: SPARK-19509 > URL: https://issues.apache.org/jira/browse/SPARK-19509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > {code:sql|title=A simple case} > select count(1) from test group by e grouping sets(e) > {code} > {code:title=Schema of the test table} > scala> spark.sql("desc test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | e| string| null| > ++-+---+ > {code} > {code:sql|title=The column `e` is empty} > scala> spark.sql("select e from test").show() > ++ > | e| > ++ > |null| > |null| > ++ > {code} > {code:title=Exception} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) > at org.apache.spark.sql.Dataset.show(Dataset.scala:636) > at org.apache.spark.sql.Dataset.show(Dataset.scala:595) > at org.apache.spark.sql.Dataset.show(Dataset.scala:604) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at >
[jira] [Commented] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859528#comment-15859528 ] Apache Spark commented on SPARK-19509: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/16873 > [SQL]GROUPING SETS throws NullPointerException when use an empty column > --- > > Key: SPARK-19509 > URL: https://issues.apache.org/jira/browse/SPARK-19509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > {code:sql|title=A simple case} > select count(1) from test group by e grouping sets(e) > {code} > {code:title=Schema of the test table} > scala> spark.sql("desc test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | e| string| null| > ++-+---+ > {code} > {code:sql|title=The column `e` is empty} > scala> spark.sql("select e from test").show() > ++ > | e| > ++ > |null| > |null| > ++ > {code} > {code:title=Exception} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) > at org.apache.spark.sql.Dataset.show(Dataset.scala:636) > at org.apache.spark.sql.Dataset.show(Dataset.scala:595) > at org.apache.spark.sql.Dataset.show(Dataset.scala:604) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at
[jira] [Assigned] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19509: Assignee: (was: Apache Spark) > [SQL]GROUPING SETS throws NullPointerException when use an empty column > --- > > Key: SPARK-19509 > URL: https://issues.apache.org/jira/browse/SPARK-19509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Priority: Critical > > {code:sql|title=A simple case} > select count(1) from test group by e grouping sets(e) > {code} > {code:title=Schema of the test table} > scala> spark.sql("desc test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | e| string| null| > ++-+---+ > {code} > {code:sql|title=The column `e` is empty} > scala> spark.sql("select e from test").show() > ++ > | e| > ++ > |null| > |null| > ++ > {code} > {code:title=Exception} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) > at org.apache.spark.sql.Dataset.show(Dataset.scala:636) > at org.apache.spark.sql.Dataset.show(Dataset.scala:595) > at org.apache.spark.sql.Dataset.show(Dataset.scala:604) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at
[jira] [Assigned] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column
[ https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19509: Assignee: Apache Spark > [SQL]GROUPING SETS throws NullPointerException when use an empty column > --- > > Key: SPARK-19509 > URL: https://issues.apache.org/jira/browse/SPARK-19509 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: StanZhai >Assignee: Apache Spark >Priority: Critical > > {code:sql|title=A simple case} > select count(1) from test group by e grouping sets(e) > {code} > {code:title=Schema of the test table} > scala> spark.sql("desc test").show() > ++-+---+ > |col_name|data_type|comment| > ++-+---+ > | e| string| null| > ++-+---+ > {code} > {code:sql|title=The column `e` is empty} > scala> spark.sql("select e from test").show() > ++ > | e| > ++ > |null| > |null| > ++ > {code} > {code:title=Exception} > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944) > at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) > at > org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113) > at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2112) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2327) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) > at org.apache.spark.sql.Dataset.show(Dataset.scala:636) > at org.apache.spark.sql.Dataset.show(Dataset.scala:595) > at org.apache.spark.sql.Dataset.show(Dataset.scala:604) > ... 48 elided > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at >
[jira] [Commented] (SPARK-19514) Range is not interruptible
[ https://issues.apache.org/jira/browse/SPARK-19514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859499#comment-15859499 ] Apache Spark commented on SPARK-19514: -- User 'ala' has created a pull request for this issue: https://github.com/apache/spark/pull/16872 > Range is not interruptible > -- > > Key: SPARK-19514 > URL: https://issues.apache.org/jira/browse/SPARK-19514 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ala Luszczak > > Currently Range cannot be interrupted. > For example, if you start executing > spark.range(0, A_LOT, 1).crossJoin(spark.range(0, A_LOT, 1)).count() > and then call > DAGScheduler.cancellStage(...) > the execution won't stop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19514) Range is not interruptible
[ https://issues.apache.org/jira/browse/SPARK-19514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19514: Assignee: Apache Spark > Range is not interruptible > -- > > Key: SPARK-19514 > URL: https://issues.apache.org/jira/browse/SPARK-19514 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ala Luszczak >Assignee: Apache Spark > > Currently Range cannot be interrupted. > For example, if you start executing > spark.range(0, A_LOT, 1).crossJoin(spark.range(0, A_LOT, 1)).count() > and then call > DAGScheduler.cancellStage(...) > the execution won't stop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19514) Range is not interruptible
[ https://issues.apache.org/jira/browse/SPARK-19514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19514: Assignee: (was: Apache Spark) > Range is not interruptible > -- > > Key: SPARK-19514 > URL: https://issues.apache.org/jira/browse/SPARK-19514 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ala Luszczak > > Currently Range cannot be interrupted. > For example, if you start executing > spark.range(0, A_LOT, 1).crossJoin(spark.range(0, A_LOT, 1)).count() > and then call > DAGScheduler.cancellStage(...) > the execution won't stop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17874) Additional SSL port on HistoryServer should be configurable
[ https://issues.apache.org/jira/browse/SPARK-17874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-17874. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.2.0 > Additional SSL port on HistoryServer should be configurable > --- > > Key: SPARK-17874 > URL: https://issues.apache.org/jira/browse/SPARK-17874 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.0.1 >Reporter: Andrew Ash >Assignee: Marcelo Vanzin > Fix For: 2.2.0 > > > When turning on SSL on the HistoryServer with > {{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the > [hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262] > result of calculating {{spark.history.ui.port + 400}}, and sets up a > redirect from the original (http) port to the new (https) port. > {noformat} > $ netstat -nlp | grep 23714 > (Not all processes could be identified, non-owned process info > will not be shown, you would have to be root to see it all.) > tcp0 0 :::18080:::* > LISTEN 23714/java > tcp0 0 :::18480:::* > LISTEN 23714/java > {noformat} > By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one > open port to change protocol from http to https, not to have 1) additional > ports open 2) the http port remain open 3) the additional port at a value I > didn't specify. > To fix this could take one of two approaches: > Approach 1: > - one port always, which is configured with {{spark.history.ui.port}} > - the protocol on that port is http by default > - or if {{spark.ssl.historyServer.enabled=true}} then it's https > Approach 2: > - add a new configuration item {{spark.history.ui.sslPort}} which configures > the second port that starts up > In approach 1 we probably need a way to specify to Spark jobs whether the > history server has ssl or not, based on SPARK-16988 > That makes me think we should go with approach 2. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19531) History server doesn't refresh jobs for long-life apps like thriftserver
Oleg Danilov created SPARK-19531: Summary: History server doesn't refresh jobs for long-life apps like thriftserver Key: SPARK-19531 URL: https://issues.apache.org/jira/browse/SPARK-19531 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Oleg Danilov If spark.history.fs.logDirectory points to hdfs, then spark history server doesn't refresh jobs page. This is caused by Hadoop - during writing to the .inprogress file Hadoop doesn't update file length until close and therefor Spark's history server is not able to detect any changes. I'm gonna submit a PR to fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19416) Dataset.schema is inconsistent with Dataset in handling columns with periods
[ https://issues.apache.org/jira/browse/SPARK-19416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859401#comment-15859401 ] Thomas Sebastian commented on SPARK-19416: -- [~josephkb] If I understand correctly the consistent behaviour should be as follows: The statement df.schema("`a.b`") should succeed and df.schema("a.b") should fail. Please confirm. + [~jayadevan.m] > Dataset.schema is inconsistent with Dataset in handling columns with periods > > > Key: SPARK-19416 > URL: https://issues.apache.org/jira/browse/SPARK-19416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > > When you have a DataFrame with a column with a period in its name, the API is > inconsistent about how to quote the column name. > Here's a reproduction: > {code} > import org.apache.spark.sql.functions.col > val rows = Seq( > ("foo", 1), > ("bar", 2) > ) > val df = spark.createDataFrame(rows).toDF("a.b", "id") > {code} > These methods are all consistent: > {code} > df.select("a.b") // fails > df.select("`a.b`") // succeeds > df.select(col("a.b")) // fails > df.select(col("`a.b`")) // succeeds > df("a.b") // fails > df("`a.b`") // succeeds > {code} > But {{schema}} is inconsistent: > {code} > df.schema("a.b") // succeeds > df.schema("`a.b`") // fails > {code} > "fails" produces error messages like: > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '`a.b`' given input > columns: [a.b, id];; > 'Project ['a.b] > +- Project [_1#1511 AS a.b#1516, _2#1512 AS id#1517] >+- LocalRelation [_1#1511, _2#1512] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296) > at > org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) > at > org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1121) > at
[jira] [Comment Edited] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase
[ https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859398#comment-15859398 ] Nattavut Sutyanyong edited comment on SPARK-18874 at 2/9/17 11:48 AM: -- Thank you, [~hvanhovell][~dkbiswal], for reviewing the design document. I have attached the revised version as a pdf format in this JIRA. [~rxin]: FYI. We will be submitting a PR for the code by Tuesday February 14. P.S. We will submit the last test PR today (Feb 9). As of now, there are 4 pending test PRs waiting for review. We hope to have all the test PRs merged before the code. This way when the code is merged, we can verify by exercising the new code against those test cases. was (Author: nsyca): Thank you, [~hvanhovell][~dkbiswal], for reviewing the design document. I have attached the revised version as a pdf format in this JIRA. @rxin: FYI. We will be submitting a PR for the code by Tuesday February 14. P.S. We will submit the last test PR today (Feb 9). As of now, there are 4 pending test PRs waiting for review. We hope to have all the test PRs merged before the code. This way when the code is merged, we can verify by exercising the new code against those test cases. > First phase: Deferring the correlated predicate pull up to Optimizer phase > -- > > Key: SPARK-18874 > URL: https://issues.apache.org/jira/browse/SPARK-18874 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Nattavut Sutyanyong > Attachments: SPARK-18874-3.pdf > > > This JIRA implements the first phase of SPARK-18455 by deferring the > correlated predicate pull up from Analyzer to Optimizer. The goal is to > preserve the current functionality of subquery in Spark 2.0 (if it works, it > continues to work after this JIRA, if it does not, it won't). The performance > of subquery processing is expected to be at par with Spark 2.0. > The representation of the LogicalPlan after Analyzer will be different after > this JIRA that it will preserve the original positions of correlated > predicates in a subquery. This new representation is a preparation work for > the second phase of extending the support of correlated subquery to cases > Spark 2.0 does not support such as deep correlation, outer references in > SELECT clause. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase
[ https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859398#comment-15859398 ] Nattavut Sutyanyong edited comment on SPARK-18874 at 2/9/17 11:49 AM: -- Thank you, [~hvanhovell], [~dkbiswal], for reviewing the design document. I have attached the revised version as a pdf format in this JIRA. [~rxin]: FYI. We will be submitting a PR for the code by Tuesday February 14. P.S. We will submit the last test PR today (Feb 9). As of now, there are 4 pending test PRs waiting for review. We hope to have all the test PRs merged before the code. This way when the code is merged, we can verify by exercising the new code against those test cases. was (Author: nsyca): Thank you, [~hvanhovell][~dkbiswal], for reviewing the design document. I have attached the revised version as a pdf format in this JIRA. [~rxin]: FYI. We will be submitting a PR for the code by Tuesday February 14. P.S. We will submit the last test PR today (Feb 9). As of now, there are 4 pending test PRs waiting for review. We hope to have all the test PRs merged before the code. This way when the code is merged, we can verify by exercising the new code against those test cases. > First phase: Deferring the correlated predicate pull up to Optimizer phase > -- > > Key: SPARK-18874 > URL: https://issues.apache.org/jira/browse/SPARK-18874 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Nattavut Sutyanyong > Attachments: SPARK-18874-3.pdf > > > This JIRA implements the first phase of SPARK-18455 by deferring the > correlated predicate pull up from Analyzer to Optimizer. The goal is to > preserve the current functionality of subquery in Spark 2.0 (if it works, it > continues to work after this JIRA, if it does not, it won't). The performance > of subquery processing is expected to be at par with Spark 2.0. > The representation of the LogicalPlan after Analyzer will be different after > this JIRA that it will preserve the original positions of correlated > predicates in a subquery. This new representation is a preparation work for > the second phase of extending the support of correlated subquery to cases > Spark 2.0 does not support such as deep correlation, outer references in > SELECT clause. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase
[ https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859398#comment-15859398 ] Nattavut Sutyanyong commented on SPARK-18874: - Thank you, [~hvanhovell][~dkbiswal], for reviewing the design document. I have attached the revised version as a pdf format in this JIRA. @rxin: FYI. We will be submitting a PR for the code by Tuesday February 14. P.S. We will submit the last test PR today (Feb 9). As of now, there are 4 pending test PRs waiting for review. We hope to have all the test PRs merged before the code. This way when the code is merged, we can verify by exercising the new code against those test cases. > First phase: Deferring the correlated predicate pull up to Optimizer phase > -- > > Key: SPARK-18874 > URL: https://issues.apache.org/jira/browse/SPARK-18874 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Nattavut Sutyanyong > Attachments: SPARK-18874-3.pdf > > > This JIRA implements the first phase of SPARK-18455 by deferring the > correlated predicate pull up from Analyzer to Optimizer. The goal is to > preserve the current functionality of subquery in Spark 2.0 (if it works, it > continues to work after this JIRA, if it does not, it won't). The performance > of subquery processing is expected to be at par with Spark 2.0. > The representation of the LogicalPlan after Analyzer will be different after > this JIRA that it will preserve the original positions of correlated > predicates in a subquery. This new representation is a preparation work for > the second phase of extending the support of correlated subquery to cases > Spark 2.0 does not support such as deep correlation, outer references in > SELECT clause. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19493) Remove Java 7 support
[ https://issues.apache.org/jira/browse/SPARK-19493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859369#comment-15859369 ] Apache Spark commented on SPARK-19493: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/16871 > Remove Java 7 support > - > > Key: SPARK-19493 > URL: https://issues.apache.org/jira/browse/SPARK-19493 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > Spark deprecated Java 7 support in 2.0, and the goal of the ticket is to > officially remove Java 7 support in 2.2 or 2.3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19493) Remove Java 7 support
[ https://issues.apache.org/jira/browse/SPARK-19493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19493: Assignee: Reynold Xin (was: Apache Spark) > Remove Java 7 support > - > > Key: SPARK-19493 > URL: https://issues.apache.org/jira/browse/SPARK-19493 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > Spark deprecated Java 7 support in 2.0, and the goal of the ticket is to > officially remove Java 7 support in 2.2 or 2.3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19493) Remove Java 7 support
[ https://issues.apache.org/jira/browse/SPARK-19493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19493: Assignee: Apache Spark (was: Reynold Xin) > Remove Java 7 support > - > > Key: SPARK-19493 > URL: https://issues.apache.org/jira/browse/SPARK-19493 > Project: Spark > Issue Type: New Feature > Components: Build >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > > Spark deprecated Java 7 support in 2.0, and the goal of the ticket is to > officially remove Java 7 support in 2.2 or 2.3. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory
[ https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859342#comment-15859342 ] Sean Owen commented on SPARK-19523: --- Not sure what this means -- there is one HiveContext per driver program and you use it where it's needed, including inside foreachRDD calls. If you are getting that error, you are not creating one but many. > Spark streaming+ insert into table leaves bunch of trash in table directory > --- > > Key: SPARK-19523 > URL: https://issues.apache.org/jira/browse/SPARK-19523 > Project: Spark > Issue Type: Improvement > Components: DStreams, SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov >Priority: Minor > > I have very simple code, which transform coming json files into pq table: > {code} > import org.apache.spark.sql.hive.HiveContext > import org.apache.hadoop.fs.Path > import org.apache.hadoop.io.{LongWritable, Text} > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat > import org.apache.spark.sql.SaveMode > object Client_log { > def main(args: Array[String]): Unit = { > val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * > from temp.x_streaming where year=2015 and month=12 and day=1").dtypes > var columns = resultCols.filter(x => > !Commons.stopColumns.contains(x._1)).map({ case (name, types) => { > s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as > ${Commons.mapType(types)}) as $name""" > } > }) > columns ++= List("'streaming' as sourcefrom") > def f(path:Path): Boolean = { > true > } > val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, > TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false) > client_log_d_stream.foreachRDD(rdd => { > val localHiveContext = new HiveContext(rdd.sparkContext) > import localHiveContext.implicits._ > var input = rdd.map(x => Record(x._2.toString)).toDF() > input = input.selectExpr(columns: _*) > input = > SmallOperators.populate(input, resultCols) > input > .write > .mode(SaveMode.Append) > .format("parquet") > .insertInto("temp.x_streaming") > }) > Spark.ssc.start() > Spark.ssc.awaitTermination() > } > case class Record(s: String) > } > {code} > This code generates a lot of trash directories in resalt table like: > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.
[ https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859303#comment-15859303 ] Sean Owen commented on SPARK-19524: --- Well, that is its definition of 'new' -- what do you expect instead? initialModTimeIgnoreThreshold controls some slack in the time it considers 'old'. > newFilesOnly does not work according to docs. > -- > > Key: SPARK-19524 > URL: https://issues.apache.org/jira/browse/SPARK-19524 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 >Reporter: Egor Pahomov > > Docs says: > newFilesOnly > Should process only new files and ignore existing files in the directory > It's not working. > http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files > says, that it shouldn't work as expected. > https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala > not clear at all in terms, what code tries to do -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19496) to_date with format has weird behavior
[ https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19496: Assignee: (was: Apache Spark) > to_date with format has weird behavior > -- > > Key: SPARK-19496 > URL: https://issues.apache.org/jira/browse/SPARK-19496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Wenchen Fan > > Today, if we run > {code} > SELECT to_date('2015-07-22', '-dd-MM') > {code} > will result to `2016-10-07`, while running > {code} > SELECT to_date('2014-31-12') # default format > {code} > will return null. > this behavior is weird and we should check other systems like hive to see if > this is expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org