[jira] [Commented] (SPARK-19536) Improve capability to merge SQL data types

2017-02-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860816#comment-15860816
 ] 

Hyukjin Kwon commented on SPARK-19536:
--

Thank you for kindly adding a link.

> Improve capability to merge SQL data types
> --
>
> Key: SPARK-19536
> URL: https://issues.apache.org/jira/browse/SPARK-19536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: koert kuipers
>Priority: Minor
>
> spark's union/merging of compatible types seems kind of weak. it works on 
> basic types in the top level record, but it fails for nested records, maps, 
> arrays, etc.
> i would like to improve this.
> for example i get errors like this:
> {noformat}
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. StructType(StructField(_1,StringType,true), 
> StructField(_2,IntegerType,false)) <> 
> StructType(StructField(_1,StringType,true), StructField(_2,LongType,false)) 
> at the first column of the second table
> {noformat}
> some examples that do work:
> {noformat}
> scala> Seq(1, 2, 3).toDF union Seq(1L, 2L, 3L).toDF
> res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: bigint]
> scala> Seq((1,"x"), (2,"x"), (3,"x")).toDF union Seq((1L,"x"), (2L,"x"), 
> (3L,"x")).toDF
> res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: bigint, 
> _2: string]
> {noformat}
> what i would also expect to work but currently doesn't:
> {noformat}
> scala> Seq((Seq(1),"x"), (Seq(2),"x"), (Seq(3),"x")).toDF union 
> Seq((Seq(1L),"x"), (Seq(2L),"x"), (Seq(3L),"x")).toDF
> scala> Seq((1,("x",1)), (2,("x",2)), (3,("x",3))).toDF union 
> Seq((1L,("x",1L)), (2L,("x",2L)), (3L,("x", 3L))).toDF
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17498:


Assignee: Apache Spark

> StringIndexer.setHandleInvalid should have another option 'new'
> ---
>
> Key: SPARK-17498
> URL: https://issues.apache.org/jira/browse/SPARK-17498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Miroslav Balaz
>Assignee: Apache Spark
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860812#comment-15860812
 ] 

Apache Spark commented on SPARK-17498:
--

User 'VinceShieh' has created a pull request for this issue:
https://github.com/apache/spark/pull/16883

> StringIndexer.setHandleInvalid should have another option 'new'
> ---
>
> Key: SPARK-17498
> URL: https://issues.apache.org/jira/browse/SPARK-17498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Miroslav Balaz
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17498:


Assignee: (was: Apache Spark)

> StringIndexer.setHandleInvalid should have another option 'new'
> ---
>
> Key: SPARK-17498
> URL: https://issues.apache.org/jira/browse/SPARK-17498
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Miroslav Balaz
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19536) Improve capability to merge SQL data types

2017-02-09 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860748#comment-15860748
 ] 

koert kuipers commented on SPARK-19536:
---

great, that only leaves maps and nested structs then i think

> Improve capability to merge SQL data types
> --
>
> Key: SPARK-19536
> URL: https://issues.apache.org/jira/browse/SPARK-19536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: koert kuipers
>Priority: Minor
>
> spark's union/merging of compatible types seems kind of weak. it works on 
> basic types in the top level record, but it fails for nested records, maps, 
> arrays, etc.
> i would like to improve this.
> for example i get errors like this:
> {noformat}
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. StructType(StructField(_1,StringType,true), 
> StructField(_2,IntegerType,false)) <> 
> StructType(StructField(_1,StringType,true), StructField(_2,LongType,false)) 
> at the first column of the second table
> {noformat}
> some examples that do work:
> {noformat}
> scala> Seq(1, 2, 3).toDF union Seq(1L, 2L, 3L).toDF
> res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: bigint]
> scala> Seq((1,"x"), (2,"x"), (3,"x")).toDF union Seq((1L,"x"), (2L,"x"), 
> (3L,"x")).toDF
> res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: bigint, 
> _2: string]
> {noformat}
> what i would also expect to work but currently doesn't:
> {noformat}
> scala> Seq((Seq(1),"x"), (Seq(2),"x"), (Seq(3),"x")).toDF union 
> Seq((Seq(1L),"x"), (Seq(2L),"x"), (Seq(3L),"x")).toDF
> scala> Seq((1,("x",1)), (2,("x",2)), (3,("x",3))).toDF union 
> Seq((1L,("x",1L)), (2L,("x",2L)), (3L,("x", 3L))).toDF
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19544) Improve error message when some column types are compatible and others are not in set/union operations

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19544:


Assignee: Apache Spark

> Improve error message when some column types are compatible and others are 
> not in set/union operations
> --
>
> Key: SPARK-19544
> URL: https://issues.apache.org/jira/browse/SPARK-19544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> Currently,
> {code}
> Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. LongType <> IntegerType at the first column 
> of the second table;;
> {code}
> It prints that it fails due to {{LongType <> IntegerType}} whereas actually 
> it is  {{StructType(...) <> StructType(...)}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19544) Improve error message when some column types are compatible and others are not in set/union operations

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860719#comment-15860719
 ] 

Apache Spark commented on SPARK-19544:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/16882

> Improve error message when some column types are compatible and others are 
> not in set/union operations
> --
>
> Key: SPARK-19544
> URL: https://issues.apache.org/jira/browse/SPARK-19544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently,
> {code}
> Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. LongType <> IntegerType at the first column 
> of the second table;;
> {code}
> It prints that it fails due to {{LongType <> IntegerType}} whereas actually 
> it is  {{StructType(...) <> StructType(...)}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19544) Improve error message when some column types are compatible and others are not in set/union operations

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19544:


Assignee: (was: Apache Spark)

> Improve error message when some column types are compatible and others are 
> not in set/union operations
> --
>
> Key: SPARK-19544
> URL: https://issues.apache.org/jira/browse/SPARK-19544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently,
> {code}
> Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. LongType <> IntegerType at the first column 
> of the second table;;
> {code}
> It prints that it fails due to {{LongType <> IntegerType}} whereas actually 
> it is  {{StructType(...) <> StructType(...)}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19544) Improve error message when some column types are compatible and others are not in set/union operations

2017-02-09 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-19544:


 Summary: Improve error message when some column types are 
compatible and others are not in set/union operations
 Key: SPARK-19544
 URL: https://issues.apache.org/jira/browse/SPARK-19544
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon
Priority: Minor


Currently,

{code}
Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF
org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
with the compatible column types. LongType <> IntegerType at the first column 
of the second table;;
{code}

It prints that it fails due to {{ LongType <> IntegerType}} whereas actually it 
is  {{StructType(...) <> StructType(...)}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19544) Improve error message when some column types are compatible and others are not in set/union operations

2017-02-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-19544:
-
Description: 
Currently,

{code}
Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF
org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
with the compatible column types. LongType <> IntegerType at the first column 
of the second table;;
{code}

It prints that it fails due to {{LongType <> IntegerType}} whereas actually it 
is  {{StructType(...) <> StructType(...)}}

  was:
Currently,

{code}
Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF
org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
with the compatible column types. LongType <> IntegerType at the first column 
of the second table;;
{code}

It prints that it fails due to {{ LongType <> IntegerType}} whereas actually it 
is  {{StructType(...) <> StructType(...)}}


> Improve error message when some column types are compatible and others are 
> not in set/union operations
> --
>
> Key: SPARK-19544
> URL: https://issues.apache.org/jira/browse/SPARK-19544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently,
> {code}
> Seq((1,("a", 1))).toDF union Seq((1L,("a", "b"))).toDF
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. LongType <> IntegerType at the first column 
> of the second table;;
> {code}
> It prints that it fails due to {{LongType <> IntegerType}} whereas actually 
> it is  {{StructType(...) <> StructType(...)}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19543) from_json fails when the input row is empty

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19543:


Assignee: (was: Apache Spark)

> from_json fails when the input row is empty 
> 
>
> Key: SPARK-19543
> URL: https://issues.apache.org/jira/browse/SPARK-19543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>
> Using from_json on a column with an empty string results in: 
> java.util.NoSuchElementException: head of empty list



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19543) from_json fails when the input row is empty

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19543:


Assignee: Apache Spark

> from_json fails when the input row is empty 
> 
>
> Key: SPARK-19543
> URL: https://issues.apache.org/jira/browse/SPARK-19543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> Using from_json on a column with an empty string results in: 
> java.util.NoSuchElementException: head of empty list



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19543) from_json fails when the input row is empty

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860671#comment-15860671
 ] 

Apache Spark commented on SPARK-19543:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/16881

> from_json fails when the input row is empty 
> 
>
> Key: SPARK-19543
> URL: https://issues.apache.org/jira/browse/SPARK-19543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Burak Yavuz
>
> Using from_json on a column with an empty string results in: 
> java.util.NoSuchElementException: head of empty list



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19543) from_json fails when the input row is empty

2017-02-09 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-19543:
---

 Summary: from_json fails when the input row is empty 
 Key: SPARK-19543
 URL: https://issues.apache.org/jira/browse/SPARK-19543
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Burak Yavuz


Using from_json on a column with an empty string results in: 
java.util.NoSuchElementException: head of empty list




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19536) Improve capability to merge SQL data types

2017-02-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860665#comment-15860665
 ] 

Hyukjin Kwon commented on SPARK-19536:
--

Oh, for {{ArrayType}}, I filed a JIRA, SPARK-19435 and opened a PR, 
https://github.com/apache/spark/pull/16777

> Improve capability to merge SQL data types
> --
>
> Key: SPARK-19536
> URL: https://issues.apache.org/jira/browse/SPARK-19536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: koert kuipers
>Priority: Minor
>
> spark's union/merging of compatible types seems kind of weak. it works on 
> basic types in the top level record, but it fails for nested records, maps, 
> arrays, etc.
> i would like to improve this.
> for example i get errors like this:
> {noformat}
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. StructType(StructField(_1,StringType,true), 
> StructField(_2,IntegerType,false)) <> 
> StructType(StructField(_1,StringType,true), StructField(_2,LongType,false)) 
> at the first column of the second table
> {noformat}
> some examples that do work:
> {noformat}
> scala> Seq(1, 2, 3).toDF union Seq(1L, 2L, 3L).toDF
> res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: bigint]
> scala> Seq((1,"x"), (2,"x"), (3,"x")).toDF union Seq((1L,"x"), (2L,"x"), 
> (3L,"x")).toDF
> res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: bigint, 
> _2: string]
> {noformat}
> what i would also expect to work but currently doesn't:
> {noformat}
> scala> Seq((Seq(1),"x"), (Seq(2),"x"), (Seq(3),"x")).toDF union 
> Seq((Seq(1L),"x"), (Seq(2L),"x"), (Seq(3L),"x")).toDF
> scala> Seq((1,("x",1)), (2,("x",2)), (3,("x",3))).toDF union 
> Seq((1L,("x",1L)), (2L,("x",2L)), (3L,("x", 3L))).toDF
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19542) Delete the temp checkpoint if a query is stopped without errors

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860661#comment-15860661
 ] 

Apache Spark commented on SPARK-19542:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16880

> Delete the temp checkpoint if a query is stopped without errors
> ---
>
> Key: SPARK-19542
> URL: https://issues.apache.org/jira/browse/SPARK-19542
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19542) Delete the temp checkpoint if a query is stopped without errors

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19542:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Delete the temp checkpoint if a query is stopped without errors
> ---
>
> Key: SPARK-19542
> URL: https://issues.apache.org/jira/browse/SPARK-19542
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19542) Delete the temp checkpoint if a query is stopped without errors

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19542:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Delete the temp checkpoint if a query is stopped without errors
> ---
>
> Key: SPARK-19542
> URL: https://issues.apache.org/jira/browse/SPARK-19542
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19542) Delete the temp checkpoint if a query is stopped without errors

2017-02-09 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-19542:


 Summary: Delete the temp checkpoint if a query is stopped without 
errors
 Key: SPARK-19542
 URL: https://issues.apache.org/jira/browse/SPARK-19542
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState

2017-02-09 Thread Kunal Khamar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khamar updated SPARK-19540:
-
Issue Type: Improvement  (was: Story)

> Add ability to clone SparkSession wherein cloned session has a reference to 
> SharedState and an identical copy of the SessionState
> -
>
> Key: SPARK-19540
> URL: https://issues.apache.org/jira/browse/SPARK-19540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Kunal Khamar
>
> Forking a newSession() from SparkSession currently makes a new SparkSession 
> that does not retain SessionState (i.e. temporary tables, SQL config, 
> registered functions etc.) This change adds a method cloneSession() which 
> creates a new SparkSession with a copy of the parent's SessionState.
> Subsequent changes to base session are not propagated to cloned session, 
> clone is independent after creation.
> If the base is changed after clone has been created, say user registers new 
> UDF, then the new UDF will not be available inside the clone. Same goes for 
> configs and temp tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19541) High Availability support for ThriftServer

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19541:


Assignee: Apache Spark

> High Availability support for ThriftServer
> --
>
> Key: SPARK-19541
> URL: https://issues.apache.org/jira/browse/SPARK-19541
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: LvDeShui
>Assignee: Apache Spark
>
> Currently, We use the spark ThriftServer frequently, and there are many 
> connects between the client and only ThriftServer.When the ThriftServer is 
> down ,we cannot get the service again until the server is restarted .So we 
> need to consider the ThriftServer HA as well as HiveServer HA. For 
> ThriftServer, we want to import the pattern of HiveServer HA to provide 
> ThriftServer HA. Therefore, we need to start multiple thrift server which 
> register on the zookeeper. Then the client  can find the thrift server by 
> just connecting to the zookeeper.So the beeline can get the service from 
> other thrift server when one is down.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19541) High Availability support for ThriftServer

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860545#comment-15860545
 ] 

Apache Spark commented on SPARK-19541:
--

User 'lvdongr' has created a pull request for this issue:
https://github.com/apache/spark/pull/16879

> High Availability support for ThriftServer
> --
>
> Key: SPARK-19541
> URL: https://issues.apache.org/jira/browse/SPARK-19541
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: LvDeShui
>
> Currently, We use the spark ThriftServer frequently, and there are many 
> connects between the client and only ThriftServer.When the ThriftServer is 
> down ,we cannot get the service again until the server is restarted .So we 
> need to consider the ThriftServer HA as well as HiveServer HA. For 
> ThriftServer, we want to import the pattern of HiveServer HA to provide 
> ThriftServer HA. Therefore, we need to start multiple thrift server which 
> register on the zookeeper. Then the client  can find the thrift server by 
> just connecting to the zookeeper.So the beeline can get the service from 
> other thrift server when one is down.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19541) High Availability support for ThriftServer

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19541:


Assignee: (was: Apache Spark)

> High Availability support for ThriftServer
> --
>
> Key: SPARK-19541
> URL: https://issues.apache.org/jira/browse/SPARK-19541
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.1
>Reporter: LvDeShui
>
> Currently, We use the spark ThriftServer frequently, and there are many 
> connects between the client and only ThriftServer.When the ThriftServer is 
> down ,we cannot get the service again until the server is restarted .So we 
> need to consider the ThriftServer HA as well as HiveServer HA. For 
> ThriftServer, we want to import the pattern of HiveServer HA to provide 
> ThriftServer HA. Therefore, we need to start multiple thrift server which 
> register on the zookeeper. Then the client  can find the thrift server by 
> just connecting to the zookeeper.So the beeline can get the service from 
> other thrift server when one is down.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19539) CREATE TEMPORARY TABLE needs to avoid existing temp view

2017-02-09 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-19539:
---
Summary: CREATE TEMPORARY TABLE needs to avoid existing temp view  (was: 
CREATE TEMPORARY TABLE need to avoid existing temp view)

> CREATE TEMPORARY TABLE needs to avoid existing temp view
> 
>
> Key: SPARK-19539
> URL: https://issues.apache.org/jira/browse/SPARK-19539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xin Wu
>
> Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to 
> use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" 
> clause.  However, if there is an existing temporary view defined, it is 
> possible to unintentionally replace this existing view by issuing "CREATE 
> TEMPORARY TABLE ... " with the same table/view name. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19539) CREATE TEMPORARY TABLE need to avoid existing temp view

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860512#comment-15860512
 ] 

Apache Spark commented on SPARK-19539:
--

User 'xwu0226' has created a pull request for this issue:
https://github.com/apache/spark/pull/16878

> CREATE TEMPORARY TABLE need to avoid existing temp view
> ---
>
> Key: SPARK-19539
> URL: https://issues.apache.org/jira/browse/SPARK-19539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xin Wu
>
> Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to 
> use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" 
> clause.  However, if there is an existing temporary view defined, it is 
> possible to unintentionally replace this existing view by issuing "CREATE 
> TEMPORARY TABLE ... " with the same table/view name. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19539) CREATE TEMPORARY TABLE need to avoid existing temp view

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19539:


Assignee: Apache Spark

> CREATE TEMPORARY TABLE need to avoid existing temp view
> ---
>
> Key: SPARK-19539
> URL: https://issues.apache.org/jira/browse/SPARK-19539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xin Wu
>Assignee: Apache Spark
>
> Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to 
> use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" 
> clause.  However, if there is an existing temporary view defined, it is 
> possible to unintentionally replace this existing view by issuing "CREATE 
> TEMPORARY TABLE ... " with the same table/view name. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19539) CREATE TEMPORARY TABLE need to avoid existing temp view

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19539:


Assignee: (was: Apache Spark)

> CREATE TEMPORARY TABLE need to avoid existing temp view
> ---
>
> Key: SPARK-19539
> URL: https://issues.apache.org/jira/browse/SPARK-19539
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xin Wu
>
> Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to 
> use "CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" 
> clause.  However, if there is an existing temporary view defined, it is 
> possible to unintentionally replace this existing view by issuing "CREATE 
> TEMPORARY TABLE ... " with the same table/view name. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19541) High Availability support for ThriftServer

2017-02-09 Thread LvDeShui (JIRA)
LvDeShui created SPARK-19541:


 Summary: High Availability support for ThriftServer
 Key: SPARK-19541
 URL: https://issues.apache.org/jira/browse/SPARK-19541
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.3.1
Reporter: LvDeShui


Currently, We use the spark ThriftServer frequently, and there are many 
connects between the client and only ThriftServer.When the ThriftServer is down 
,we cannot get the service again until the server is restarted .So we need to 
consider the ThriftServer HA as well as HiveServer HA. For ThriftServer, we 
want to import the pattern of HiveServer HA to provide ThriftServer HA. 
Therefore, we need to start multiple thrift server which register on the 
zookeeper. Then the client  can find the thrift server by just connecting to 
the zookeeper.So the beeline can get the service from other thrift server when 
one is down.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19540:


Assignee: Apache Spark

> Add ability to clone SparkSession wherein cloned session has a reference to 
> SharedState and an identical copy of the SessionState
> -
>
> Key: SPARK-19540
> URL: https://issues.apache.org/jira/browse/SPARK-19540
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Kunal Khamar
>Assignee: Apache Spark
>
> Forking a newSession() from SparkSession currently makes a new SparkSession 
> that does not retain SessionState (i.e. temporary tables, SQL config, 
> registered functions etc.) This change adds a method cloneSession() which 
> creates a new SparkSession with a copy of the parent's SessionState.
> Subsequent changes to base session are not propagated to cloned session, 
> clone is independent after creation.
> If the base is changed after clone has been created, say user registers new 
> UDF, then the new UDF will not be available inside the clone. Same goes for 
> configs and temp tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19540:


Assignee: (was: Apache Spark)

> Add ability to clone SparkSession wherein cloned session has a reference to 
> SharedState and an identical copy of the SessionState
> -
>
> Key: SPARK-19540
> URL: https://issues.apache.org/jira/browse/SPARK-19540
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Kunal Khamar
>
> Forking a newSession() from SparkSession currently makes a new SparkSession 
> that does not retain SessionState (i.e. temporary tables, SQL config, 
> registered functions etc.) This change adds a method cloneSession() which 
> creates a new SparkSession with a copy of the parent's SessionState.
> Subsequent changes to base session are not propagated to cloned session, 
> clone is independent after creation.
> If the base is changed after clone has been created, say user registers new 
> UDF, then the new UDF will not be available inside the clone. Same goes for 
> configs and temp tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860508#comment-15860508
 ] 

Apache Spark commented on SPARK-19540:
--

User 'kunalkhamar' has created a pull request for this issue:
https://github.com/apache/spark/pull/16826

> Add ability to clone SparkSession wherein cloned session has a reference to 
> SharedState and an identical copy of the SessionState
> -
>
> Key: SPARK-19540
> URL: https://issues.apache.org/jira/browse/SPARK-19540
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Kunal Khamar
>
> Forking a newSession() from SparkSession currently makes a new SparkSession 
> that does not retain SessionState (i.e. temporary tables, SQL config, 
> registered functions etc.) This change adds a method cloneSession() which 
> creates a new SparkSession with a copy of the parent's SessionState.
> Subsequent changes to base session are not propagated to cloned session, 
> clone is independent after creation.
> If the base is changed after clone has been created, say user registers new 
> UDF, then the new UDF will not be available inside the clone. Same goes for 
> configs and temp tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19540) Add ability to clone SparkSession wherein cloned session has a reference to SharedState and an identical copy of the SessionState

2017-02-09 Thread Kunal Khamar (JIRA)
Kunal Khamar created SPARK-19540:


 Summary: Add ability to clone SparkSession wherein cloned session 
has a reference to SharedState and an identical copy of the SessionState
 Key: SPARK-19540
 URL: https://issues.apache.org/jira/browse/SPARK-19540
 Project: Spark
  Issue Type: Story
  Components: SQL
Affects Versions: 2.1.1
Reporter: Kunal Khamar


Forking a newSession() from SparkSession currently makes a new SparkSession 
that does not retain SessionState (i.e. temporary tables, SQL config, 
registered functions etc.) This change adds a method cloneSession() which 
creates a new SparkSession with a copy of the parent's SessionState.
Subsequent changes to base session are not propagated to cloned session, clone 
is independent after creation.
If the base is changed after clone has been created, say user registers new 
UDF, then the new UDF will not be available inside the clone. Same goes for 
configs and temp tables.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19539) CREATE TEMPORARY TABLE need to avoid existing temp view

2017-02-09 Thread Xin Wu (JIRA)
Xin Wu created SPARK-19539:
--

 Summary: CREATE TEMPORARY TABLE need to avoid existing temp view
 Key: SPARK-19539
 URL: https://issues.apache.org/jira/browse/SPARK-19539
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xin Wu


Current "CREATE TEMPORARY TABLE ... " is deprecated and recommend users to use 
"CREATE TEMPORARY VIEW ..." And it does not support "IF NOT EXISTS" clause.  
However, if there is an existing temporary view defined, it is possible to 
unintentionally replace this existing view by issuing "CREATE TEMPORARY TABLE 
... " with the same table/view name. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19538:


Assignee: Apache Spark  (was: Kay Ousterhout)

> DAGScheduler and TaskSetManager can have an inconsistent view of whether a 
> stage is complete.
> -
>
> Key: SPARK-19538
> URL: https://issues.apache.org/jira/browse/SPARK-19538
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>
> The pendingPartitions in Stage tracks partitions that still need to be 
> computed, and is used by the DAGScheduler to determine when to mark the stage 
> as complete.  In most cases, this variable is exactly consistent with the 
> tasks in the TaskSetManager (for the current version of the stage) that are 
> still pending.  However, as discussed in SPARK-19263, these can become 
> inconsistent when an ShuffleMapTask for an earlier attempt of the stage 
> completes, in which case the DAGScheduler may think the stage has finished, 
> while the TaskSetManager is still waiting for some tasks to complete (see the 
> description in this pull request: 
> https://github.com/apache/spark/pull/16620).  This leads to bugs like 
> SPARK-19263.  Another problem with this behavior is that listeners can get 
> two StageCompleted messages: once when the DAGScheduler thinks the stage is 
> complete, and a second when the TaskSetManager later decides the stage is 
> complete.  We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860427#comment-15860427
 ] 

Apache Spark commented on SPARK-19538:
--

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/16877

> DAGScheduler and TaskSetManager can have an inconsistent view of whether a 
> stage is complete.
> -
>
> Key: SPARK-19538
> URL: https://issues.apache.org/jira/browse/SPARK-19538
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>
> The pendingPartitions in Stage tracks partitions that still need to be 
> computed, and is used by the DAGScheduler to determine when to mark the stage 
> as complete.  In most cases, this variable is exactly consistent with the 
> tasks in the TaskSetManager (for the current version of the stage) that are 
> still pending.  However, as discussed in SPARK-19263, these can become 
> inconsistent when an ShuffleMapTask for an earlier attempt of the stage 
> completes, in which case the DAGScheduler may think the stage has finished, 
> while the TaskSetManager is still waiting for some tasks to complete (see the 
> description in this pull request: 
> https://github.com/apache/spark/pull/16620).  This leads to bugs like 
> SPARK-19263.  Another problem with this behavior is that listeners can get 
> two StageCompleted messages: once when the DAGScheduler thinks the stage is 
> complete, and a second when the TaskSetManager later decides the stage is 
> complete.  We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19538:


Assignee: Kay Ousterhout  (was: Apache Spark)

> DAGScheduler and TaskSetManager can have an inconsistent view of whether a 
> stage is complete.
> -
>
> Key: SPARK-19538
> URL: https://issues.apache.org/jira/browse/SPARK-19538
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>
> The pendingPartitions in Stage tracks partitions that still need to be 
> computed, and is used by the DAGScheduler to determine when to mark the stage 
> as complete.  In most cases, this variable is exactly consistent with the 
> tasks in the TaskSetManager (for the current version of the stage) that are 
> still pending.  However, as discussed in SPARK-19263, these can become 
> inconsistent when an ShuffleMapTask for an earlier attempt of the stage 
> completes, in which case the DAGScheduler may think the stage has finished, 
> while the TaskSetManager is still waiting for some tasks to complete (see the 
> description in this pull request: 
> https://github.com/apache/spark/pull/16620).  This leads to bugs like 
> SPARK-19263.  Another problem with this behavior is that listeners can get 
> two StageCompleted messages: once when the DAGScheduler thinks the stage is 
> complete, and a second when the TaskSetManager later decides the stage is 
> complete.  We should fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19537:


Assignee: Kay Ousterhout  (was: Apache Spark)

> Move the pendingPartitions variable from Stage to ShuffleMapStage
> -
>
> Key: SPARK-19537
> URL: https://issues.apache.org/jira/browse/SPARK-19537
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> This variable is only used by ShuffleMapStages, and it is confusing to have 
> it in the Stage class rather than the ShuffleMapStage class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860417#comment-15860417
 ] 

Apache Spark commented on SPARK-19537:
--

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/16876

> Move the pendingPartitions variable from Stage to ShuffleMapStage
> -
>
> Key: SPARK-19537
> URL: https://issues.apache.org/jira/browse/SPARK-19537
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
>
> This variable is only used by ShuffleMapStages, and it is confusing to have 
> it in the Stage class rather than the ShuffleMapStage class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19537:


Assignee: Apache Spark  (was: Kay Ousterhout)

> Move the pendingPartitions variable from Stage to ShuffleMapStage
> -
>
> Key: SPARK-19537
> URL: https://issues.apache.org/jira/browse/SPARK-19537
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Kay Ousterhout
>Assignee: Apache Spark
>Priority: Minor
>
> This variable is only used by ShuffleMapStages, and it is confusing to have 
> it in the Stage class rather than the ShuffleMapStage class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.

2017-02-09 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-19538:
--

 Summary: DAGScheduler and TaskSetManager can have an inconsistent 
view of whether a stage is complete.
 Key: SPARK-19538
 URL: https://issues.apache.org/jira/browse/SPARK-19538
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout


The pendingPartitions in Stage tracks partitions that still need to be 
computed, and is used by the DAGScheduler to determine when to mark the stage 
as complete.  In most cases, this variable is exactly consistent with the tasks 
in the TaskSetManager (for the current version of the stage) that are still 
pending.  However, as discussed in SPARK-19263, these can become inconsistent 
when an ShuffleMapTask for an earlier attempt of the stage completes, in which 
case the DAGScheduler may think the stage has finished, while the 
TaskSetManager is still waiting for some tasks to complete (see the description 
in this pull request: https://github.com/apache/spark/pull/16620).  This leads 
to bugs like SPARK-19263.  Another problem with this behavior is that listeners 
can get two StageCompleted messages: once when the DAGScheduler thinks the 
stage is complete, and a second when the TaskSetManager later decides the stage 
is complete.  We should fix this.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage

2017-02-09 Thread Kay Ousterhout (JIRA)
Kay Ousterhout created SPARK-19537:
--

 Summary: Move the pendingPartitions variable from Stage to 
ShuffleMapStage
 Key: SPARK-19537
 URL: https://issues.apache.org/jira/browse/SPARK-19537
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.1.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


This variable is only used by ShuffleMapStages, and it is confusing to have it 
in the Stage class rather than the ShuffleMapStage class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5864) support .jar as python package

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860373#comment-15860373
 ] 

Apache Spark commented on SPARK-5864:
-

User 'kunalkhamar' has created a pull request for this issue:
https://github.com/apache/spark/pull/16826

> support .jar as python package
> --
>
> Key: SPARK-5864
> URL: https://issues.apache.org/jira/browse/SPARK-5864
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.3.0
>
>
> Support .jar file as python package (same to .zip or .egg)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19536) Improve capability to merge SQL data types

2017-02-09 Thread koert kuipers (JIRA)
koert kuipers created SPARK-19536:
-

 Summary: Improve capability to merge SQL data types
 Key: SPARK-19536
 URL: https://issues.apache.org/jira/browse/SPARK-19536
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: koert kuipers
Priority: Minor


spark's union/merging of compatible types seems kind of weak. it works on basic 
types in the top level record, but it fails for nested records, maps, arrays, 
etc.

i would like to improve this.

for example i get errors like this:
{noformat}
org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
with the compatible column types. StructType(StructField(_1,StringType,true), 
StructField(_2,IntegerType,false)) <> 
StructType(StructField(_1,StringType,true), StructField(_2,LongType,false)) at 
the first column of the second table
{noformat}
some examples that do work:
{noformat}
scala> Seq(1, 2, 3).toDF union Seq(1L, 2L, 3L).toDF
res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: bigint]

scala> Seq((1,"x"), (2,"x"), (3,"x")).toDF union Seq((1L,"x"), (2L,"x"), 
(3L,"x")).toDF
res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [_1: bigint, _2: 
string]
{noformat}
what i would also expect to work but currently doesn't:
{noformat}
scala> Seq((Seq(1),"x"), (Seq(2),"x"), (Seq(3),"x")).toDF union 
Seq((Seq(1L),"x"), (Seq(2L),"x"), (Seq(3L),"x")).toDF

scala> Seq((1,("x",1)), (2,("x",2)), (3,("x",3))).toDF union Seq((1L,("x",1L)), 
(2L,("x",2L)), (3L,("x", 3L))).toDF
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16713) Limit codegen method size to 8KB

2017-02-09 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-16713.
-
Resolution: Won't Fix

According to the discussion on PR, I think we can close this as WON'T FIX for 
now.

> Limit codegen method size to 8KB
> 
>
> Key: SPARK-16713
> URL: https://issues.apache.org/jira/browse/SPARK-16713
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0
>Reporter: Qifan Pu
>
> Ideally, we would wish codegen methods to be less than 8KB for bytecode size. 
> Beyond 8K JIT won't compile and can cause performance degradation. We have 
> seen this for queries with wide schema (30+ fields), where 
> agg_doAggregateWithKeys() can be more than 8K. This is also a major reason 
> for performance regression when we enable fash aggregate hashmap (such as 
> using VectorizedHashMapGenerator.scala).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19525) Enable Compression of Spark Streaming Checkpoints

2017-02-09 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860236#comment-15860236
 ] 

Shixiong Zhu commented on SPARK-19525:
--

[~rameshaaditya117] Sounds a good idea. I thought the metadata checkpoint 
should be very small. How large of your checkpoint files?

> Enable Compression of Spark Streaming Checkpoints
> -
>
> Key: SPARK-19525
> URL: https://issues.apache.org/jira/browse/SPARK-19525
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Aaditya Ramesh
>
> In our testing, compressing partitions while writing them to checkpoints on 
> HDFS using snappy helped performance significantly while also reducing the 
> variability of the checkpointing operation. In our tests, checkpointing time 
> was reduced by 3X, and variability was reduced by 2X for data sets of 
> compressed size approximately 1 GB.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13510) Shuffle may throw FetchFailedException: Direct buffer memory

2017-02-09 Thread Srikanth Daggumalli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860220#comment-15860220
 ] 

Srikanth Daggumalli commented on SPARK-13510:
-

OOM is not an issue. But would like to know about the following ERRORs

16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 requests 
outstanding when connection from /10.196.134.220:7337 is closed
16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block 
shuffle_3_81_2, and will not retry (0 retries)

why does they written in logs as ERROR?


> Shuffle may throw FetchFailedException: Direct buffer memory
> 
>
> Key: SPARK-13510
> URL: https://issues.apache.org/jira/browse/SPARK-13510
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Hong Shen
> Attachments: spark-13510.diff
>
>
> In our cluster, when I test spark-1.6.0 with a sql, it throw exception and 
> failed.
> {code}
> 16/02/17 15:36:03 INFO storage.ShuffleBlockFetcherIterator: Sending request 
> for 1 blocks (915.4 MB) from 10.196.134.220:7337
> 16/02/17 15:36:03 INFO shuffle.ExternalShuffleClient: External shuffle fetch 
> from 10.196.134.220:7337 (executor id 122)
> 16/02/17 15:36:03 INFO client.TransportClient: Sending fetch chunk request 0 
> to /10.196.134.220:7337
> 16/02/17 15:36:36 WARN server.TransportChannelHandler: Exception in 
> connection from /10.196.134.220:7337
> java.lang.OutOfMemoryError: Direct buffer memory
>   at java.nio.Bits.reserveMemory(Bits.java:658)
>   at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
>   at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
>   at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
>   at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
>   at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
>   at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
>   at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
>   at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at java.lang.Thread.run(Thread.java:744)
> 16/02/17 15:36:36 ERROR client.TransportResponseHandler: Still have 1 
> requests outstanding when connection from /10.196.134.220:7337 is closed
> 16/02/17 15:36:36 ERROR shuffle.RetryingBlockFetcher: Failed to fetch block 
> shuffle_3_81_2, and will not retry (0 retries)
> {code}
>   The reason is that when shuffle a big block(like 1G), task will allocate 
> the same memory, it will easily throw "FetchFailedException: Direct buffer 
> memory".
>   If I add -Dio.netty.noUnsafe=true spark.executor.extraJavaOptions, it will 
> throw 
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> io.netty.buffer.PoolArena$HeapArena.newUnpooledChunk(PoolArena.java:607)
> at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:237)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:215)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
> {code}
>   
>   In mapreduce shuffle, it will firstly judge whether the block can cache in 
> memery, but spark doesn't. 
>   If the block is more than we can cache in memory, we  should write to disk.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data

2017-02-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860110#comment-15860110
 ] 

Joseph K. Bradley commented on SPARK-10802:
---

Linking related issue for feature parity in DataFrame-based API.


> Let ALS recommend for subset of data
> 
>
> Key: SPARK-10802
> URL: https://issues.apache.org/jira/browse/SPARK-10802
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Currently MatrixFactorizationModel allows to get recommendations for
> - single user 
> - single product 
> - all users
> - all products
> recommendation for all users/products do a cartesian join inside.
> It would be useful in some cases to get recommendations for subset of 
> users/products by providing an RDD with which MatrixFactorizationModel could 
> do an intersection before doing a cartesian join. This would make it much 
> faster in situation where recommendations are needed only for subset of 
> users/products, and when the subset is still too large to make it feasible to 
> recommend one-by-one.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19535) ALSModel recommendAll analogs

2017-02-09 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-19535:
-

 Summary: ALSModel recommendAll analogs
 Key: SPARK-19535
 URL: https://issues.apache.org/jira/browse/SPARK-19535
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley
Assignee: Sue Ann Hong


Add methods analogous to the spark.mllib MatrixFactorizationModel methods 
recommendProductsForUsers/UsersForProducts.

The initial implementation should be very simple, using DataFrame joins.  
Future work can add optimizations.

I recommend naming them:
* recommendForAllUsers
* recommendForAllItems




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2017-02-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860108#comment-15860108
 ] 

Joseph K. Bradley commented on SPARK-13857:
---

Hi all, catching up these many ALS discussions now.  This work to support 
evaluation and tuning for recommendation is great, but I'm worried about it not 
being resolved in time for 2.2.  I've heard a lot of requests for the plain 
functionality available in spark.mllib for recommendUsers/Products, so I'd 
recommend we just add those methods for now as a short-term solution.  Let's 
keep working on the evaluation/tuning plans too.  I'll create a JIRA for adding 
basic recommendUsers/Products methods.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19509) GROUPING SETS throws NullPointerException when use an empty column

2017-02-09 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19509.
---
   Resolution: Fixed
 Assignee: StanZhai
Fix Version/s: 2.1.1
   2.0.3

> GROUPING SETS throws NullPointerException when use an empty column
> --
>
> Key: SPARK-19509
> URL: https://issues.apache.org/jira/browse/SPARK-19509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: StanZhai
> Fix For: 2.0.3, 2.1.1
>
>
> {code:sql|title=A simple case}
> select count(1) from test group by e grouping sets(e)
> {code}
> {code:title=Schema of the test table}
> scala> spark.sql("desc test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> |   e|   string|   null|
> ++-+---+
> {code}
> {code:sql|title=The column `e` is empty}
> scala> spark.sql("select e from test").show()
> ++
> |   e|
> ++
> |null|
> |null|
> ++
> {code}
> {code:title=Exception}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> 

[jira] [Commented] (SPARK-19512) codegen for compare structs fails

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860053#comment-15860053
 ] 

Apache Spark commented on SPARK-19512:
--

User 'bogdanrdc' has created a pull request for this issue:
https://github.com/apache/spark/pull/16875

> codegen for compare structs fails
> -
>
> Key: SPARK-19512
> URL: https://issues.apache.org/jira/browse/SPARK-19512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
> Fix For: 2.2.0
>
>
> This (1 struct field)
> {code:java|title=1 struct field}
> spark.range(10)
>   .selectExpr("named_struct('a', id) as col1", "named_struct('a', id+2) 
> as col2")
>   .filter("col1 = col2").count
> {code}
> fails with
> {code}
> [info]   Cause: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 144, Column 32: Expression "range_value" is not an 
> rvalue
> {code}
> This (2 struct fields)
> {code:java|title=2 struct fields}
> spark.range(10)
> .selectExpr("named_struct('a', id, 'b', id) as col1", 
> "named_struct('a',id+2, 'b',id+2) as col2")
> .filter($"col1" === $"col2").count
> {code}
> fails with 
> {code}
> Caused by: java.lang.IndexOutOfBoundsException: 1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory

2017-02-09 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860028#comment-15860028
 ] 

Shixiong Zhu commented on SPARK-19523:
--

You can create a HiveContext before creating StreamingContext. Then in your 
functions, you can use "SQLContext.getOrCreate(rdd.sparkContext)" to get the 
HiveContext. Although it returns a SQLContext, but the real type is actually 
HiveContext.

> Spark streaming+ insert into table leaves bunch of trash in table directory
> ---
>
> Key: SPARK-19523
> URL: https://issues.apache.org/jira/browse/SPARK-19523
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>Priority: Minor
>
> I have very simple code, which transform coming json files into pq table:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.io.{LongWritable, Text}
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
> import org.apache.spark.sql.SaveMode
> object Client_log {
>   def main(args: Array[String]): Unit = {
> val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * 
> from temp.x_streaming where year=2015 and month=12 and day=1").dtypes
> var columns = resultCols.filter(x => 
> !Commons.stopColumns.contains(x._1)).map({ case (name, types) => {
>   s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as 
> ${Commons.mapType(types)}) as $name"""
> }
> })
> columns ++= List("'streaming' as sourcefrom")
> def f(path:Path): Boolean = {
>   true
> }
> val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, 
> TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false)
> client_log_d_stream.foreachRDD(rdd => {
>   val localHiveContext = new HiveContext(rdd.sparkContext)
>   import localHiveContext.implicits._
>   var input = rdd.map(x => Record(x._2.toString)).toDF()
>   input = input.selectExpr(columns: _*)
>   input =
> SmallOperators.populate(input, resultCols)
>   input
> .write
> .mode(SaveMode.Append)
> .format("parquet")
> .insertInto("temp.x_streaming")
> })
> Spark.ssc.start()
> Spark.ssc.awaitTermination()
>   }
>   case class Record(s: String)
> }
> {code}
> This code generates a lot of trash directories in resalt table like:
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory

2017-02-09 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860025#comment-15860025
 ] 

Egor Pahomov commented on SPARK-19523:
--

Probably my bad. I have {code} rdd.sparkContext {code} inside lambda. I do not 
have {code} rdd.sqlContext {code}. where I can take it? 

> Spark streaming+ insert into table leaves bunch of trash in table directory
> ---
>
> Key: SPARK-19523
> URL: https://issues.apache.org/jira/browse/SPARK-19523
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>Priority: Minor
>
> I have very simple code, which transform coming json files into pq table:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.io.{LongWritable, Text}
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
> import org.apache.spark.sql.SaveMode
> object Client_log {
>   def main(args: Array[String]): Unit = {
> val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * 
> from temp.x_streaming where year=2015 and month=12 and day=1").dtypes
> var columns = resultCols.filter(x => 
> !Commons.stopColumns.contains(x._1)).map({ case (name, types) => {
>   s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as 
> ${Commons.mapType(types)}) as $name"""
> }
> })
> columns ++= List("'streaming' as sourcefrom")
> def f(path:Path): Boolean = {
>   true
> }
> val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, 
> TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false)
> client_log_d_stream.foreachRDD(rdd => {
>   val localHiveContext = new HiveContext(rdd.sparkContext)
>   import localHiveContext.implicits._
>   var input = rdd.map(x => Record(x._2.toString)).toDF()
>   input = input.selectExpr(columns: _*)
>   input =
> SmallOperators.populate(input, resultCols)
>   input
> .write
> .mode(SaveMode.Append)
> .format("parquet")
> .insertInto("temp.x_streaming")
> })
> Spark.ssc.start()
> Spark.ssc.awaitTermination()
>   }
>   case class Record(s: String)
> }
> {code}
> This code generates a lot of trash directories in resalt table like:
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-09 Thread Egor Pahomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15860023#comment-15860023
 ] 

Egor Pahomov commented on SPARK-19524:
--

I'm really confused. I expected "new" to be the files created after start of 
streaming job and old ones to be everything else in the folder. If we change 
definition of "new", than I believe everything consistent between each other. 
It's just I'm not sure that this "new" definition is very intuitive. I want to 
process everything in folder - existing and upcoming. I use this flag. And now 
it turns out, that this flag has it's own definition of "new". My be I'm not 
correct to call it a bug, but isn't it all very confusing for person, who does 
not really know who everything works inside? 

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19481) Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in ClosureCleaner

2017-02-09 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-19481.

   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16825
[https://github.com/apache/spark/pull/16825]

> Fix flaky test: o.a.s.repl.ReplSuite should clone and clean line object in 
> ClosureCleaner
> -
>
> Key: SPARK-19481
> URL: https://issues.apache.org/jira/browse/SPARK-19481
> Project: Spark
>  Issue Type: Test
>  Components: Spark Shell
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.1.1, 2.2.0
>
>
> org.apache.spark.repl.cancelOnInterrupt leaks a SparkContext and makes the 
> tests unstable. See:
> http://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.repl.ReplSuite_name=should+clone+and+clean+line+object+in+ClosureCleaner



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10141) Number of tasks on executors still become negative after failures

2017-02-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-10141.
-
Resolution: Done

> Number of tasks on executors still become negative after failures
> -
>
> Key: SPARK-10141
> URL: https://issues.apache.org/jira/browse/SPARK-10141
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Minor
> Attachments: Screen Shot 2015-08-20 at 3.14.49 PM.png
>
>
> I hit this failure when running LDA on EC2 (after I made the model size 
> really big).
> I was using the LDAExample.scala code on an EC2 cluster with 16 workers 
> (r3.2xlarge), on a Wikipedia dataset:
> {code}
> Training set size (documents) 4534059
> Vocabulary size (terms)   1
> Training set size (tokens)895575317
> EM optimizer
> 1K topics
> {code}
> Failure message:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in 
> stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 
> (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to 
> /10.0.202.128:54740
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>   at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   ... 1 more
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1805)
>   at 

[jira] [Assigned] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-02-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-17975:
-

Assignee: Tathagata Das

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
>Assignee: Tathagata Das
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Resolved] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-02-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-17975.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, 

[jira] [Updated] (SPARK-14804) Graph vertexRDD/EdgeRDD checkpoint results ClassCastException:

2017-02-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14804:
--
Fix Version/s: (was: 3.0.0)
   2.2.0

> Graph vertexRDD/EdgeRDD checkpoint results ClassCastException: 
> ---
>
> Key: SPARK-14804
> URL: https://issues.apache.org/jira/browse/SPARK-14804
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1
>Reporter: SuYan
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> {code}
> graph3.vertices.checkpoint()
> graph3.vertices.count()
> graph3.vertices.map(_._2).count()
> {code}
> 16/04/21 21:04:43 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 4.0 
> (TID 13, localhost): java.lang.ClassCastException: 
> org.apache.spark.graphx.impl.ShippableVertexPartition cannot be cast to 
> scala.Tuple2
>   at 
> com.xiaomi.infra.codelab.spark.Graph2$$anonfun$main$1.apply(Graph2.scala:80)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1597)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1161)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1863)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:91)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> look at the code:
> {code}
>   private[spark] def computeOrReadCheckpoint(split: Partition, context: 
> TaskContext): Iterator[T] =
>   {
> if (isCheckpointedAndMaterialized) {
>   firstParent[T].iterator(split, context)
> } else {
>   compute(split, context)
> }
>   }
>  private[spark] def isCheckpointedAndMaterialized: Boolean = isCheckpointed
>  override def isCheckpointed: Boolean = {
>firstParent[(PartitionID, EdgePartition[ED, VD])].isCheckpointed
>  }
> {code}
> for VertexRDD or EdgeRDD, first parent is its partitionRDD  
> RDD[ShippableVertexPartition[VD]]/RDD[(PartitionID, EdgePartition[ED, VD])]
> 1. we call vertexRDD.checkpoint, it partitionRDD will checkpoint, so 
> VertexRDD.isCheckpointedAndMaterialized=true.
> 2. then we call vertexRDD.iterator, because checkoint=true it called 
> firstParent.iterator(which is not CheckpointRDD, actually is partitionRDD). 
>  
> so returned iterator is iterator[ShippableVertexPartition] not expect 
> iterator[(VertexId, VD)]]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16554) Spark should kill executors when they are blacklisted

2017-02-09 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-16554.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16650
[https://github.com/apache/spark/pull/16650]

> Spark should kill executors when they are blacklisted
> -
>
> Key: SPARK-16554
> URL: https://issues.apache.org/jira/browse/SPARK-16554
> Project: Spark
>  Issue Type: New Feature
>  Components: Scheduler
>Reporter: Imran Rashid
>Assignee: Jose Soltren
> Fix For: 2.2.0
>
>
> SPARK-8425 will allow blacklisting faulty executors and nodes.  However, 
> these blacklisted executors will continue to run.  This is bad for a few 
> reasons:
> (1) Even if there is faulty-hardware, if the cluster is under-utilized spark 
> may be able to request another executor on a different node.
> (2) If there is a faulty-disk (the most common case of faulty-hardware), the 
> cluster manager may be able to allocate another executor on the same node, if 
> it can exclude the bad disk.  (Yarn will do this with its disk-health 
> checker.)
> With dynamic allocation, this may seem less critical, as a blacklisted 
> executor will stop running new tasks and eventually get reclaimed.  However, 
> if there is cached data on those executors, they will not get killed till 
> {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} expires, which is 
> (effectively) infinite by default.
> Users may not *always* want to kill bad executors, so this must be 
> configurable to some extent.  At a minimum, it should be possible to enable / 
> disable it; perhaps the executor should be killed after it has been 
> blacklisted a configurable {{N}} times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17975) EMLDAOptimizer fails with ClassCastException on YARN

2017-02-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859984#comment-15859984
 ] 

Joseph K. Bradley commented on SPARK-17975:
---

Will do, thanks!

> EMLDAOptimizer fails with ClassCastException on YARN
> 
>
> Key: SPARK-17975
> URL: https://issues.apache.org/jira/browse/SPARK-17975
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
> Environment: Centos 6, CDH 5.7, Java 1.7u80
>Reporter: Jeff Stein
> Attachments: docs.txt
>
>
> I'm able to reproduce the error consistently with a 2000 record text file 
> with each record having 1-5 terms and checkpointing enabled. It looks like 
> the problem was introduced with the resolution for SPARK-13355.
> The EdgeRDD class seems to be lying about it's type in a way that causes 
> RDD.mapPartitionsWithIndex method to be unusable when it's referenced as an 
> RDD of Edge elements.
> {code}
> val spark = SparkSession.builder.appName("lda").getOrCreate()
> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoints")
> val data: RDD[(Long, Vector)] = // snip
> data.setName("data").cache()
> val lda = new LDA
> val optimizer = new EMLDAOptimizer
> lda.setOptimizer(optimizer)
>   .setK(10)
>   .setMaxIterations(400)
>   .setAlpha(-1)
>   .setBeta(-1)
>   .setCheckpointInterval(7)
> val ldaModel = lda.run(data)
> {code}
> {noformat}
> 16/10/16 23:53:54 WARN TaskSetManager: Lost task 3.0 in stage 348.0 (TID 
> 1225, server2.domain): java.lang.ClassCastException: scala.Tuple2 cannot be 
> cast to org.apache.spark.graphx.Edge
>   at 
> org.apache.spark.graphx.EdgeRDD$$anonfun$1$$anonfun$apply$1.apply(EdgeRDD.scala:107)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
>   at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$25.apply(RDD.scala:820)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:332)
>   at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:330)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:935)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:926)
>   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
>   at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:926)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:670)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:281)
>   at org.apache.spark.graphx.EdgeRDD.compute(EdgeRDD.scala:50)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:722)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-10141) Number of tasks on executors still become negative after failures

2017-02-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859982#comment-15859982
 ] 

Joseph K. Bradley commented on SPARK-10141:
---

I'll close this if no one has seen it in Spark 2.0 or 2.1.  Thanks all!

> Number of tasks on executors still become negative after failures
> -
>
> Key: SPARK-10141
> URL: https://issues.apache.org/jira/browse/SPARK-10141
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Priority: Minor
> Attachments: Screen Shot 2015-08-20 at 3.14.49 PM.png
>
>
> I hit this failure when running LDA on EC2 (after I made the model size 
> really big).
> I was using the LDAExample.scala code on an EC2 cluster with 16 workers 
> (r3.2xlarge), on a Wikipedia dataset:
> {code}
> Training set size (documents) 4534059
> Vocabulary size (terms)   1
> Training set size (tokens)895575317
> EM optimizer
> 1K topics
> {code}
> Failure message:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 55 in 
> stage 22.0 failed 4 times, most recent failure: Lost task 55.3 in stage 22.0 
> (TID 2881, 10.0.202.128): java.io.IOException: Failed to connect to 
> /10.0.202.128:54740
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
>   at 
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.ConnectException: Connection refused: /10.0.202.128:54740
>   at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>   at 
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
>   at 
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>   at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   ... 1 more
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1267)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1255)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1254)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:684)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:684)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1480)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1442)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1431)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:554)
>   at 

[jira] [Resolved] (SPARK-19025) Remove SQL builder for operators

2017-02-09 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19025.
---
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.2.0

> Remove SQL builder for operators
> 
>
> Key: SPARK-19025
> URL: https://issues.apache.org/jira/browse/SPARK-19025
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
> Fix For: 2.2.0
>
>
> With the new approach of view resolution, we can get rid of SQL generation on 
> view creation, so let's remove SQL builder for operators.
> Note that, since all sql generation for operators is defined in one file 
> (org.apache.spark.sql.catalyst.SQLBuilder), it’d be trivial to recover it in 
> the future.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19512) codegen for compare structs fails

2017-02-09 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19512.
---
   Resolution: Fixed
 Assignee: Bogdan Raducanu
Fix Version/s: 2.2.0

> codegen for compare structs fails
> -
>
> Key: SPARK-19512
> URL: https://issues.apache.org/jira/browse/SPARK-19512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Bogdan Raducanu
>Assignee: Bogdan Raducanu
> Fix For: 2.2.0
>
>
> This (1 struct field)
> {code:java|title=1 struct field}
> spark.range(10)
>   .selectExpr("named_struct('a', id) as col1", "named_struct('a', id+2) 
> as col2")
>   .filter("col1 = col2").count
> {code}
> fails with
> {code}
> [info]   Cause: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 144, Column 32: Expression "range_value" is not an 
> rvalue
> {code}
> This (2 struct fields)
> {code:java|title=2 struct fields}
> spark.range(10)
> .selectExpr("named_struct('a', id, 'b', id) as col1", 
> "named_struct('a',id+2, 'b',id+2) as col2")
> .filter($"col1" === $"col2").count
> {code}
> fails with 
> {code}
> Caused by: java.lang.IndexOutOfBoundsException: 1
>   at 
> scala.collection.LinearSeqOptimized$class.apply(LinearSeqOptimized.scala:65)
>   at scala.collection.immutable.List.apply(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.BoundReference.doGenCode(BoundAttribute.scala:64)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19514) Range is not interruptible

2017-02-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-19514.
-
   Resolution: Fixed
 Assignee: Ala Luszczak
Fix Version/s: 2.2.0

> Range is not interruptible
> --
>
> Key: SPARK-19514
> URL: https://issues.apache.org/jira/browse/SPARK-19514
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ala Luszczak
>Assignee: Ala Luszczak
> Fix For: 2.2.0
>
>
> Currently Range cannot be interrupted.
> For example, if you start executing
> spark.range(0, A_LOT, 1).crossJoin(spark.range(0, A_LOT, 1)).count()
> and then call
> DAGScheduler.cancellStage(...)
> the execution won't stop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18005) optional binary Dataframe Column throws (UTF8) is not a group while loading a Dataframe

2017-02-09 Thread Eric Maynard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859913#comment-15859913
 ] 

Eric Maynard commented on SPARK-18005:
--

This bug appears in the 1.6.x branch as well.

> optional binary Dataframe Column throws (UTF8) is not a group while loading a 
> Dataframe
> ---
>
> Key: SPARK-18005
> URL: https://issues.apache.org/jira/browse/SPARK-18005
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
>Reporter: ABHISHEK CHOUDHARY
>
> In some scenario, while loading a Parquet file, spark is throwing exception 
> as-
> java.lang.ClassCastException: optional binary CertificateChains (UTF8) is not 
> a group
> Entire Dataframe is not corrupted as I managed to load starting 20 rows of 
> the data but trying to load the next one throws the error and any operations 
> over entire dataset throws the same exception like count.
> Full Exception Stack -
> {quote}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
> stage 594.0 failed 4 times, most recent failure: Lost task 2.3 in stage 594.0 
> (TID 6726, ): java.lang.ClassCastException: optional binary CertificateChains 
> (UTF8) is not a group
>   at org.apache.parquet.schema.Type.asGroupType(Type.java:202)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.org$apache$spark$sql$execution$datasources$parquet$ParquetReadSupport$$clipParquetType(ParquetReadSupport.scala:122)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1$$anonfun$apply$1.apply(ParquetReadSupport.scala:272)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:272)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$$anonfun$clipParquetGroupFields$1.apply(ParquetReadSupport.scala:269)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetGroupFields(ParquetReadSupport.scala:269)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport$.clipParquetSchema(ParquetReadSupport.scala:111)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.init(ParquetReadSupport.scala:67)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:168)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:192)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:377)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:339)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:116)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.columnar.InMemoryRelation$$anonfun$3$$anon$1.hasNext(InMemoryRelation.scala:151)
>   at 
> org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:213)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
>   at 
> 

[jira] [Comment Edited] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859795#comment-15859795
 ] 

Christophe Préaud edited comment on SPARK-19477 at 2/9/17 5:01 PM:
---

As a workaround, mapping the dataset with *identity* fix the issue:

{code:java}
scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

scala> ds.map(identity).show
+---+---+---+
| f1| f2| f3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  h|  i|  j|
+---+---+---+
{code}


was (Author: preaudc):
As a workaround, mapping the dataset with *identity* fix the issue:

{code:scala}
scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

scala> ds.map(identity).show
+---+---+---+
| f1| f2| f3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  h|  i|  j|
+---+---+---+
{code}

> [SQL] Datasets created from a Dataframe with extra columns retain the extra 
> columns
> ---
>
> Key: SPARK-19477
> URL: https://issues.apache.org/jira/browse/SPARK-19477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Drake
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, 
> the columns not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
> string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more 
> fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more 
> fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: 
> string, f3[0]: string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-09 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe Préaud updated SPARK-19477:
--
Comment: was deleted

(was: As a workaround, mapping the dataset with *identity* fix the issue:

{code:java}
scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

scala> ds.map(identity).show
+---+---+---+
| f1| f2| f3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  h|  i|  j|
+---+---+---+
{code})

> [SQL] Datasets created from a Dataframe with extra columns retain the extra 
> columns
> ---
>
> Key: SPARK-19477
> URL: https://issues.apache.org/jira/browse/SPARK-19477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Drake
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, 
> the columns not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
> string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more 
> fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more 
> fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: 
> string, f3[0]: string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859795#comment-15859795
 ] 

Christophe Préaud edited comment on SPARK-19477 at 2/9/17 5:00 PM:
---

As a workaround, mapping the dataset with *identity* fix the issue:

{code:scala}
scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

scala> ds.map(identity).show
+---+---+---+
| f1| f2| f3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  h|  i|  j|
+---+---+---+
{code}


was (Author: preaudc):
As a workaround, mapping the dataset with *identity* fix the issue:

```
scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

scala> ds.map(identity).show
+---+---+---+
| f1| f2| f3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  h|  i|  j|
+---+---+---+
```

> [SQL] Datasets created from a Dataframe with extra columns retain the extra 
> columns
> ---
>
> Key: SPARK-19477
> URL: https://issues.apache.org/jira/browse/SPARK-19477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Drake
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, 
> the columns not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
> string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more 
> fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more 
> fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: 
> string, f3[0]: string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859795#comment-15859795
 ] 

Christophe Préaud edited comment on SPARK-19477 at 2/9/17 4:59 PM:
---

As a workaround, mapping the dataset with *identity* fix the issue:

```
scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

scala> ds.map(identity).show
+---+---+---+
| f1| f2| f3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  h|  i|  j|
+---+---+---+
```


was (Author: preaudc):
As a workaround, mapping the dataset with *identity* fix the issue:

```scala
scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

scala> ds.map(identity).show
+---+---+---+
| f1| f2| f3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  h|  i|  j|
+---+---+---+
```

> [SQL] Datasets created from a Dataframe with extra columns retain the extra 
> columns
> ---
>
> Key: SPARK-19477
> URL: https://issues.apache.org/jira/browse/SPARK-19477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Drake
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, 
> the columns not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
> string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more 
> fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more 
> fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: 
> string, f3[0]: string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19477) [SQL] Datasets created from a Dataframe with extra columns retain the extra columns

2017-02-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-19477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859795#comment-15859795
 ] 

Christophe Préaud commented on SPARK-19477:
---

As a workaround, mapping the dataset with *identity* fix the issue:

```scala
scala> ds.show
+---+---+---+---+
| f1| f2| f3| c4|
+---+---+---+---+
|  a|  b|  c|  x|
|  d|  e|  f|  y|
|  h|  i|  j|  z|
+---+---+---+---+

scala> ds.map(identity).show
+---+---+---+
| f1| f2| f3|
+---+---+---+
|  a|  b|  c|
|  d|  e|  f|
|  h|  i|  j|
+---+---+---+
```

> [SQL] Datasets created from a Dataframe with extra columns retain the extra 
> columns
> ---
>
> Key: SPARK-19477
> URL: https://issues.apache.org/jira/browse/SPARK-19477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Don Drake
>
> In 1.6, when you created a Dataset from a Dataframe that had extra columns, 
> the columns not in the case class were dropped from the Dataset.
> For example in 1.6, the column c4 is gone:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import sqlContext.implicits._
> import sqlContext.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string, f3: string, c4: 
> string]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string, f3: string]
> scala> ds.show
> +---+---+---+
> | f1| f2| f3|
> +---+---+---+
> |  a|  b|  c|
> |  d|  e|  f|
> |  h|  i|  j|
> {code}
> This seems to have changed in Spark 2.0 and also 2.1:
> Spark 2.1.0:
> {code}
> scala> case class F(f1: String, f2: String, f3:String)
> defined class F
> scala> import spark.implicits._
> import spark.implicits._
> scala> val df = Seq(("a","b","c","x"), ("d", "e", "f","y"), ("h", "i", 
> "j","z")).toDF("f1", "f2", "f3", "c4")
> df: org.apache.spark.sql.DataFrame = [f1: string, f2: string ... 2 more 
> fields]
> scala> val ds = df.as[F]
> ds: org.apache.spark.sql.Dataset[F] = [f1: string, f2: string ... 2 more 
> fields]
> scala> ds.show
> +---+---+---+---+
> | f1| f2| f3| c4|
> +---+---+---+---+
> |  a|  b|  c|  x|
> |  d|  e|  f|  y|
> |  h|  i|  j|  z|
> +---+---+---+---+
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> val fEncoder = Encoders.product[F]
> fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: 
> string, f3[0]: string]
> scala> fEncoder.schema == ds.schema
> res2: Boolean = false
> scala> ds.schema
> res3: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true), StructField(c4,StringType,true))
> scala> fEncoder.schema
> res4: org.apache.spark.sql.types.StructType = 
> StructType(StructField(f1,StringType,true), StructField(f2,StringType,true), 
> StructField(f3,StringType,true))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use

2017-02-09 Thread Peter D Kirchner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859737#comment-15859737
 ] 

Peter D Kirchner commented on SPARK-19348:
--

Per the above, perhaps a fix could address both threadsafety and the orphaned 
modifications to variables in the wrapped function bodies.  For instance, in 
Pipeline.__init__() and setParams() the fix could remove references to 
_input_kwargs and instead invoke setParams(stages=stages) within __init__() and 
_set(stages=stages) within setParams(), respectively,  Parallel changes would 
be needed in all wrapped functions but the resulting code would be functional, 
readable, and threadsafe.

A fix for Pipeline alone is insufficient, because multiple pipelines could have 
stages consisting of instances of the same class, e.g. LogisticRegression.  All 
the classes using @keyword_only need to be addressed by whatever fix is decided 
upon.

> pyspark.ml.Pipeline gets corrupted under multi threaded use
> ---
>
> Key: SPARK-19348
> URL: https://issues.apache.org/jira/browse/SPARK-19348
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Vinayak Joshi
> Attachments: pyspark_pipeline_threads.py
>
>
> When pyspark.ml.Pipeline objects are constructed concurrently in separate 
> python threads, it is observed that the stages used to construct a pipeline 
> object get corrupted i.e the stages supplied to a Pipeline object in one 
> thread appear inside a different Pipeline object constructed in a different 
> thread. 
> Things work fine if construction of pyspark.ml.Pipeline objects is 
> serialized, so this looks like a thread safety problem with 
> pyspark.ml.Pipeline object construction. 
> Confirmed that the problem exists with Spark 1.6.x as well as 2.x.
> While the corruption of the Pipeline stages is easily caught, we need to know 
> if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) 
> are also affected by the underlying cause of this problem. That is, whether 
> other pipeline operations like pyspark.ml.pipeline.fit( )  may be performed 
> in separate threads (on distinct pipeline objects) concurrently without any 
> cross contamination between them.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Deleted] (SPARK-19494) Remove java7 pom profile

2017-02-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin deleted SPARK-19494:



> Remove java7 pom profile
> 
>
> Key: SPARK-19494
> URL: https://issues.apache.org/jira/browse/SPARK-19494
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19494) Remove java7 pom profile

2017-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19494:
--
Priority: Trivial  (was: Major)

(I think this part is pretty trivial and covered in the base change for the 
parent JIRA.)

> Remove java7 pom profile
> 
>
> Key: SPARK-19494
> URL: https://issues.apache.org/jira/browse/SPARK-19494
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19534) Convert Java tests to use lambdas, Java 8 features

2017-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19534:
--
Issue Type: Sub-task  (was: Task)
Parent: SPARK-19493

> Convert Java tests to use lambdas, Java 8 features
> --
>
> Key: SPARK-19534
> URL: https://issues.apache.org/jira/browse/SPARK-19534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> Likewise, Java tests can be simplified by use of Java 8 lambdas. This is a 
> significant sub-task in its own right. This shouldn't mean that 'old' APIs go 
> untested because there are no separate Java 8 APIs; it's just syntactic sugar 
> for calls to the same APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19534) Convert Java tests to use lambdas, Java 8 features

2017-02-09 Thread Sean Owen (JIRA)
Sean Owen created SPARK-19534:
-

 Summary: Convert Java tests to use lambdas, Java 8 features
 Key: SPARK-19534
 URL: https://issues.apache.org/jira/browse/SPARK-19534
 Project: Spark
  Issue Type: Task
  Components: Tests
Affects Versions: 2.2.0
Reporter: Sean Owen
Assignee: Sean Owen


Likewise, Java tests can be simplified by use of Java 8 lambdas. This is a 
significant sub-task in its own right. This shouldn't mean that 'old' APIs go 
untested because there are no separate Java 8 APIs; it's just syntactic sugar 
for calls to the same APIs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19533) Convert Java examples to use lambdas, Java 8 features

2017-02-09 Thread Sean Owen (JIRA)
Sean Owen created SPARK-19533:
-

 Summary: Convert Java examples to use lambdas, Java 8 features
 Key: SPARK-19533
 URL: https://issues.apache.org/jira/browse/SPARK-19533
 Project: Spark
  Issue Type: Task
  Components: Examples
Affects Versions: 2.2.0
Reporter: Sean Owen
Assignee: Sean Owen


As a subtask of the overall migration to Java 8, we can and probably should use 
Java 8 lambdas to simplify the Java examples. I'm marking this as a subtask in 
its own right because it's a pretty big change by lines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19533) Convert Java examples to use lambdas, Java 8 features

2017-02-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-19533:
--
Issue Type: Sub-task  (was: Task)
Parent: SPARK-19493

> Convert Java examples to use lambdas, Java 8 features
> -
>
> Key: SPARK-19533
> URL: https://issues.apache.org/jira/browse/SPARK-19533
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>
> As a subtask of the overall migration to Java 8, we can and probably should 
> use Java 8 lambdas to simplify the Java examples. I'm marking this as a 
> subtask in its own right because it's a pretty big change by lines.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19532) [Core]`DataStreamer for file` threads of DFSOutputStream leak if set `spark.speculation` to true

2017-02-09 Thread StanZhai (JIRA)
StanZhai created SPARK-19532:


 Summary: [Core]`DataStreamer for file` threads of DFSOutputStream 
leak if set `spark.speculation` to true
 Key: SPARK-19532
 URL: https://issues.apache.org/jira/browse/SPARK-19532
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.1.0
Reporter: StanZhai
Priority: Blocker


When set `spark.speculation` to true, from thread dump page of Executor of 
WebUI, I found that there are about 1300 threads named  "DataStreamer for file 
/test/data/test_temp/_temporary/0/_temporary/attempt_20170207172435_80750_m_69_1/part-00069-690407af-0900-46b1-9590-a6d6c696fe68.snappy.parquet"
 in TIMED_WAITING state.

{code}
java.lang.Object.wait(Native Method)
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:564)
{code}

The off-heap memory exceeds a lot until Executor exited with OOM exception. 

This problem occurs only when writing data to the Hadoop(tasks may be killed by 
Executor during writing).




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column

2017-02-09 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-19509:
--
Priority: Major  (was: Critical)

> [SQL]GROUPING SETS throws NullPointerException when use an empty column
> ---
>
> Key: SPARK-19509
> URL: https://issues.apache.org/jira/browse/SPARK-19509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>
> {code:sql|title=A simple case}
> select count(1) from test group by e grouping sets(e)
> {code}
> {code:title=Schema of the test table}
> scala> spark.sql("desc test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> |   e|   string|   null|
> ++-+---+
> {code}
> {code:sql|title=The column `e` is empty}
> scala> spark.sql("select e from test").show()
> ++
> |   e|
> ++
> |null|
> |null|
> ++
> {code}
> {code:title=Exception}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at 

[jira] [Updated] (SPARK-19509) GROUPING SETS throws NullPointerException when use an empty column

2017-02-09 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-19509:
--
Summary: GROUPING SETS throws NullPointerException when use an empty column 
 (was: [SQL]GROUPING SETS throws NullPointerException when use an empty column)

> GROUPING SETS throws NullPointerException when use an empty column
> --
>
> Key: SPARK-19509
> URL: https://issues.apache.org/jira/browse/SPARK-19509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>
> {code:sql|title=A simple case}
> select count(1) from test group by e grouping sets(e)
> {code}
> {code:title=Schema of the test table}
> scala> spark.sql("desc test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> |   e|   string|   null|
> ++-+---+
> {code}
> {code:sql|title=The column `e` is empty}
> scala> spark.sql("select e from test").show()
> ++
> |   e|
> ++
> |null|
> |null|
> ++
> {code}
> {code:title=Exception}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> 

[jira] [Commented] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859541#comment-15859541
 ] 

Apache Spark commented on SPARK-19509:
--

User 'stanzhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/16874

> [SQL]GROUPING SETS throws NullPointerException when use an empty column
> ---
>
> Key: SPARK-19509
> URL: https://issues.apache.org/jira/browse/SPARK-19509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> {code:sql|title=A simple case}
> select count(1) from test group by e grouping sets(e)
> {code}
> {code:title=Schema of the test table}
> scala> spark.sql("desc test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> |   e|   string|   null|
> ++-+---+
> {code}
> {code:sql|title=The column `e` is empty}
> scala> spark.sql("select e from test").show()
> ++
> |   e|
> ++
> |null|
> |null|
> ++
> {code}
> {code:title=Exception}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> 

[jira] [Commented] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859528#comment-15859528
 ] 

Apache Spark commented on SPARK-19509:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/16873

> [SQL]GROUPING SETS throws NullPointerException when use an empty column
> ---
>
> Key: SPARK-19509
> URL: https://issues.apache.org/jira/browse/SPARK-19509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> {code:sql|title=A simple case}
> select count(1) from test group by e grouping sets(e)
> {code}
> {code:title=Schema of the test table}
> scala> spark.sql("desc test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> |   e|   string|   null|
> ++-+---+
> {code}
> {code:sql|title=The column `e` is empty}
> scala> spark.sql("select e from test").show()
> ++
> |   e|
> ++
> |null|
> |null|
> ++
> {code}
> {code:title=Exception}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 

[jira] [Assigned] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19509:


Assignee: (was: Apache Spark)

> [SQL]GROUPING SETS throws NullPointerException when use an empty column
> ---
>
> Key: SPARK-19509
> URL: https://issues.apache.org/jira/browse/SPARK-19509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Priority: Critical
>
> {code:sql|title=A simple case}
> select count(1) from test group by e grouping sets(e)
> {code}
> {code:title=Schema of the test table}
> scala> spark.sql("desc test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> |   e|   string|   null|
> ++-+---+
> {code}
> {code:sql|title=The column `e` is empty}
> scala> spark.sql("select e from test").show()
> ++
> |   e|
> ++
> |null|
> |null|
> ++
> {code}
> {code:title=Exception}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at 

[jira] [Assigned] (SPARK-19509) [SQL]GROUPING SETS throws NullPointerException when use an empty column

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19509:


Assignee: Apache Spark

> [SQL]GROUPING SETS throws NullPointerException when use an empty column
> ---
>
> Key: SPARK-19509
> URL: https://issues.apache.org/jira/browse/SPARK-19509
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: Apache Spark
>Priority: Critical
>
> {code:sql|title=A simple case}
> select count(1) from test group by e grouping sets(e)
> {code}
> {code:title=Schema of the test table}
> scala> spark.sql("desc test").show()
> ++-+---+
> |col_name|data_type|comment|
> ++-+---+
> |   e|   string|   null|
> ++-+---+
> {code}
> {code:sql|title=The column `e` is empty}
> scala> spark.sql("select e from test").show()
> ++
> |   e|
> ++
> |null|
> |null|
> ++
> {code}
> {code:title=Exception}
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)
>   at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)
>   at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2371)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
>   at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2765)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2370)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2377)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2113)
>   at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2795)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:636)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:595)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:604)
>   ... 48 elided
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> 

[jira] [Commented] (SPARK-19514) Range is not interruptible

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859499#comment-15859499
 ] 

Apache Spark commented on SPARK-19514:
--

User 'ala' has created a pull request for this issue:
https://github.com/apache/spark/pull/16872

> Range is not interruptible
> --
>
> Key: SPARK-19514
> URL: https://issues.apache.org/jira/browse/SPARK-19514
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ala Luszczak
>
> Currently Range cannot be interrupted.
> For example, if you start executing
> spark.range(0, A_LOT, 1).crossJoin(spark.range(0, A_LOT, 1)).count()
> and then call
> DAGScheduler.cancellStage(...)
> the execution won't stop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19514) Range is not interruptible

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19514:


Assignee: Apache Spark

> Range is not interruptible
> --
>
> Key: SPARK-19514
> URL: https://issues.apache.org/jira/browse/SPARK-19514
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ala Luszczak
>Assignee: Apache Spark
>
> Currently Range cannot be interrupted.
> For example, if you start executing
> spark.range(0, A_LOT, 1).crossJoin(spark.range(0, A_LOT, 1)).count()
> and then call
> DAGScheduler.cancellStage(...)
> the execution won't stop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19514) Range is not interruptible

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19514:


Assignee: (was: Apache Spark)

> Range is not interruptible
> --
>
> Key: SPARK-19514
> URL: https://issues.apache.org/jira/browse/SPARK-19514
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ala Luszczak
>
> Currently Range cannot be interrupted.
> For example, if you start executing
> spark.range(0, A_LOT, 1).crossJoin(spark.range(0, A_LOT, 1)).count()
> and then call
> DAGScheduler.cancellStage(...)
> the execution won't stop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17874) Additional SSL port on HistoryServer should be configurable

2017-02-09 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-17874.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.2.0

> Additional SSL port on HistoryServer should be configurable
> ---
>
> Key: SPARK-17874
> URL: https://issues.apache.org/jira/browse/SPARK-17874
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.0.1
>Reporter: Andrew Ash
>Assignee: Marcelo Vanzin
> Fix For: 2.2.0
>
>
> When turning on SSL on the HistoryServer with 
> {{spark.ssl.historyServer.enabled=true}} this opens up a second port, at the 
> [hardcoded|https://github.com/apache/spark/blob/v2.0.1/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L262]
>  result of calculating {{spark.history.ui.port + 400}}, and sets up a 
> redirect from the original (http) port to the new (https) port.
> {noformat}
> $ netstat -nlp | grep 23714
> (Not all processes could be identified, non-owned process info
>  will not be shown, you would have to be root to see it all.)
> tcp0  0 :::18080:::*
> LISTEN  23714/java
> tcp0  0 :::18480:::*
> LISTEN  23714/java
> {noformat}
> By enabling {{spark.ssl.historyServer.enabled}} I would have expected the one 
> open port to change protocol from http to https, not to have 1) additional 
> ports open 2) the http port remain open 3) the additional port at a value I 
> didn't specify.
> To fix this could take one of two approaches:
> Approach 1:
> - one port always, which is configured with {{spark.history.ui.port}}
> - the protocol on that port is http by default
> - or if {{spark.ssl.historyServer.enabled=true}} then it's https
> Approach 2:
> - add a new configuration item {{spark.history.ui.sslPort}} which configures 
> the second port that starts up
> In approach 1 we probably need a way to specify to Spark jobs whether the 
> history server has ssl or not, based on SPARK-16988
> That makes me think we should go with approach 2.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19531) History server doesn't refresh jobs for long-life apps like thriftserver

2017-02-09 Thread Oleg Danilov (JIRA)
Oleg Danilov created SPARK-19531:


 Summary: History server doesn't refresh jobs for long-life apps 
like thriftserver
 Key: SPARK-19531
 URL: https://issues.apache.org/jira/browse/SPARK-19531
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Oleg Danilov


If spark.history.fs.logDirectory points to hdfs, then spark history server 
doesn't refresh jobs page. This is caused by Hadoop - during writing to the 
.inprogress file Hadoop doesn't update file length until close and therefor 
Spark's history server is not able to detect any changes.

I'm gonna submit a PR to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19416) Dataset.schema is inconsistent with Dataset in handling columns with periods

2017-02-09 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859401#comment-15859401
 ] 

Thomas Sebastian commented on SPARK-19416:
--

[~josephkb]
If I understand correctly the consistent behaviour should be as follows:
The statement df.schema("`a.b`")  should succeed and
df.schema("a.b")  should fail.
Please confirm.

+ [~jayadevan.m]

> Dataset.schema is inconsistent with Dataset in handling columns with periods
> 
>
> Key: SPARK-19416
> URL: https://issues.apache.org/jira/browse/SPARK-19416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.0, 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> When you have a DataFrame with a column with a period in its name, the API is 
> inconsistent about how to quote the column name.
> Here's a reproduction:
> {code}
> import org.apache.spark.sql.functions.col
> val rows = Seq(
>   ("foo", 1),
>   ("bar", 2)
> )
> val df = spark.createDataFrame(rows).toDF("a.b", "id")
> {code}
> These methods are all consistent:
> {code}
> df.select("a.b") // fails
> df.select("`a.b`") // succeeds
> df.select(col("a.b")) // fails
> df.select(col("`a.b`")) // succeeds
> df("a.b") // fails
> df("`a.b`") // succeeds
> {code}
> But {{schema}} is inconsistent:
> {code}
> df.schema("a.b") // succeeds
> df.schema("`a.b`") // fails
> {code}
> "fails" produces error messages like:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '`a.b`' given input 
> columns: [a.b, id];;
> 'Project ['a.b]
> +- Project [_1#1511 AS a.b#1516, _2#1512 AS id#1517]
>+- LocalRelation [_1#1511, _2#1512]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:77)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:310)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:309)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:282)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:292)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:296)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:296)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:301)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:74)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:128)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:48)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>   at 
> org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2822)
>   at org.apache.spark.sql.Dataset.select(Dataset.scala:1121)
>   at 

[jira] [Comment Edited] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase

2017-02-09 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859398#comment-15859398
 ] 

Nattavut Sutyanyong edited comment on SPARK-18874 at 2/9/17 11:48 AM:
--

Thank you, [~hvanhovell][~dkbiswal], for reviewing the design document. I have 
attached the revised version as a pdf format in this JIRA. [~rxin]: FYI.

We will be submitting a PR for the code by Tuesday February 14.

P.S. We will submit the last test PR today (Feb 9). As of now, there are 4 
pending test PRs waiting for review. We hope to have all the test PRs merged 
before the code. This way when the code is merged, we can verify by exercising 
the new code against those test cases.


was (Author: nsyca):
Thank you, [~hvanhovell][~dkbiswal], for reviewing the design document. I have 
attached the revised version as a pdf format in this JIRA. @rxin: FYI.

We will be submitting a PR for the code by Tuesday February 14.

P.S. We will submit the last test PR today (Feb 9). As of now, there are 4 
pending test PRs waiting for review. We hope to have all the test PRs merged 
before the code. This way when the code is merged, we can verify by exercising 
the new code against those test cases.

> First phase: Deferring the correlated predicate pull up to Optimizer phase
> --
>
> Key: SPARK-18874
> URL: https://issues.apache.org/jira/browse/SPARK-18874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nattavut Sutyanyong
> Attachments: SPARK-18874-3.pdf
>
>
> This JIRA implements the first phase of SPARK-18455 by deferring the 
> correlated predicate pull up from Analyzer to Optimizer. The goal is to 
> preserve the current functionality of subquery in Spark 2.0 (if it works, it 
> continues to work after this JIRA, if it does not, it won't). The performance 
> of subquery processing is expected to be at par with Spark 2.0.
> The representation of the LogicalPlan after Analyzer will be different after 
> this JIRA that it will preserve the original positions of correlated 
> predicates in a subquery. This new representation is a preparation work for 
> the second phase of extending the support of correlated subquery to cases 
> Spark 2.0 does not support such as deep correlation, outer references in 
> SELECT clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase

2017-02-09 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859398#comment-15859398
 ] 

Nattavut Sutyanyong edited comment on SPARK-18874 at 2/9/17 11:49 AM:
--

Thank you, [~hvanhovell], [~dkbiswal], for reviewing the design document. I 
have attached the revised version as a pdf format in this JIRA. [~rxin]: FYI.

We will be submitting a PR for the code by Tuesday February 14.

P.S. We will submit the last test PR today (Feb 9). As of now, there are 4 
pending test PRs waiting for review. We hope to have all the test PRs merged 
before the code. This way when the code is merged, we can verify by exercising 
the new code against those test cases.


was (Author: nsyca):
Thank you, [~hvanhovell][~dkbiswal], for reviewing the design document. I have 
attached the revised version as a pdf format in this JIRA. [~rxin]: FYI.

We will be submitting a PR for the code by Tuesday February 14.

P.S. We will submit the last test PR today (Feb 9). As of now, there are 4 
pending test PRs waiting for review. We hope to have all the test PRs merged 
before the code. This way when the code is merged, we can verify by exercising 
the new code against those test cases.

> First phase: Deferring the correlated predicate pull up to Optimizer phase
> --
>
> Key: SPARK-18874
> URL: https://issues.apache.org/jira/browse/SPARK-18874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nattavut Sutyanyong
> Attachments: SPARK-18874-3.pdf
>
>
> This JIRA implements the first phase of SPARK-18455 by deferring the 
> correlated predicate pull up from Analyzer to Optimizer. The goal is to 
> preserve the current functionality of subquery in Spark 2.0 (if it works, it 
> continues to work after this JIRA, if it does not, it won't). The performance 
> of subquery processing is expected to be at par with Spark 2.0.
> The representation of the LogicalPlan after Analyzer will be different after 
> this JIRA that it will preserve the original positions of correlated 
> predicates in a subquery. This new representation is a preparation work for 
> the second phase of extending the support of correlated subquery to cases 
> Spark 2.0 does not support such as deep correlation, outer references in 
> SELECT clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18874) First phase: Deferring the correlated predicate pull up to Optimizer phase

2017-02-09 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859398#comment-15859398
 ] 

Nattavut Sutyanyong commented on SPARK-18874:
-

Thank you, [~hvanhovell][~dkbiswal], for reviewing the design document. I have 
attached the revised version as a pdf format in this JIRA. @rxin: FYI.

We will be submitting a PR for the code by Tuesday February 14.

P.S. We will submit the last test PR today (Feb 9). As of now, there are 4 
pending test PRs waiting for review. We hope to have all the test PRs merged 
before the code. This way when the code is merged, we can verify by exercising 
the new code against those test cases.

> First phase: Deferring the correlated predicate pull up to Optimizer phase
> --
>
> Key: SPARK-18874
> URL: https://issues.apache.org/jira/browse/SPARK-18874
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nattavut Sutyanyong
> Attachments: SPARK-18874-3.pdf
>
>
> This JIRA implements the first phase of SPARK-18455 by deferring the 
> correlated predicate pull up from Analyzer to Optimizer. The goal is to 
> preserve the current functionality of subquery in Spark 2.0 (if it works, it 
> continues to work after this JIRA, if it does not, it won't). The performance 
> of subquery processing is expected to be at par with Spark 2.0.
> The representation of the LogicalPlan after Analyzer will be different after 
> this JIRA that it will preserve the original positions of correlated 
> predicates in a subquery. This new representation is a preparation work for 
> the second phase of extending the support of correlated subquery to cases 
> Spark 2.0 does not support such as deep correlation, outer references in 
> SELECT clause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19493) Remove Java 7 support

2017-02-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859369#comment-15859369
 ] 

Apache Spark commented on SPARK-19493:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/16871

> Remove Java 7 support
> -
>
> Key: SPARK-19493
> URL: https://issues.apache.org/jira/browse/SPARK-19493
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Spark deprecated Java 7 support in 2.0, and the goal of the ticket is to 
> officially remove Java 7 support in 2.2 or 2.3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19493) Remove Java 7 support

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19493:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove Java 7 support
> -
>
> Key: SPARK-19493
> URL: https://issues.apache.org/jira/browse/SPARK-19493
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Spark deprecated Java 7 support in 2.0, and the goal of the ticket is to 
> officially remove Java 7 support in 2.2 or 2.3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19493) Remove Java 7 support

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19493:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove Java 7 support
> -
>
> Key: SPARK-19493
> URL: https://issues.apache.org/jira/browse/SPARK-19493
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Spark deprecated Java 7 support in 2.0, and the goal of the ticket is to 
> officially remove Java 7 support in 2.2 or 2.3.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory

2017-02-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859342#comment-15859342
 ] 

Sean Owen commented on SPARK-19523:
---

Not sure what this means -- there is one HiveContext per driver program and you 
use it where it's needed, including inside foreachRDD calls. If you are getting 
that error, you are not creating one but many.

> Spark streaming+ insert into table leaves bunch of trash in table directory
> ---
>
> Key: SPARK-19523
> URL: https://issues.apache.org/jira/browse/SPARK-19523
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>Priority: Minor
>
> I have very simple code, which transform coming json files into pq table:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.io.{LongWritable, Text}
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
> import org.apache.spark.sql.SaveMode
> object Client_log {
>   def main(args: Array[String]): Unit = {
> val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * 
> from temp.x_streaming where year=2015 and month=12 and day=1").dtypes
> var columns = resultCols.filter(x => 
> !Commons.stopColumns.contains(x._1)).map({ case (name, types) => {
>   s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as 
> ${Commons.mapType(types)}) as $name"""
> }
> })
> columns ++= List("'streaming' as sourcefrom")
> def f(path:Path): Boolean = {
>   true
> }
> val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, 
> TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false)
> client_log_d_stream.foreachRDD(rdd => {
>   val localHiveContext = new HiveContext(rdd.sparkContext)
>   import localHiveContext.implicits._
>   var input = rdd.map(x => Record(x._2.toString)).toDF()
>   input = input.selectExpr(columns: _*)
>   input =
> SmallOperators.populate(input, resultCols)
>   input
> .write
> .mode(SaveMode.Append)
> .format("parquet")
> .insertInto("temp.x_streaming")
> })
> Spark.ssc.start()
> Spark.ssc.awaitTermination()
>   }
>   case class Record(s: String)
> }
> {code}
> This code generates a lot of trash directories in resalt table like:
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19524) newFilesOnly does not work according to docs.

2017-02-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15859303#comment-15859303
 ] 

Sean Owen commented on SPARK-19524:
---

Well, that is its definition of 'new' -- what do you expect instead? 
initialModTimeIgnoreThreshold controls some slack in the time it considers 
'old'.

> newFilesOnly does not work according to docs. 
> --
>
> Key: SPARK-19524
> URL: https://issues.apache.org/jira/browse/SPARK-19524
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> Docs says:
> newFilesOnly
> Should process only new files and ignore existing files in the directory
> It's not working. 
> http://stackoverflow.com/questions/29852249/how-spark-streaming-identifies-new-files
>  says, that it shouldn't work as expected. 
> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala
>  not clear at all in terms, what code tries to do



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19496) to_date with format has weird behavior

2017-02-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19496:


Assignee: (was: Apache Spark)

> to_date with format has weird behavior
> --
>
> Key: SPARK-19496
> URL: https://issues.apache.org/jira/browse/SPARK-19496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Wenchen Fan
>
> Today, if we run
> {code}
> SELECT to_date('2015-07-22', '-dd-MM')
> {code}
> will result to `2016-10-07`, while running
> {code}
> SELECT to_date('2014-31-12')   # default format
> {code}
> will return null.
> this behavior is weird and we should check other systems like hive to see if 
> this is expected.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >