[jira] [Commented] (SPARK-15615) Support for creating a dataframe from JSON in Dataset[String]

2017-02-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884558#comment-15884558
 ] 

Apache Spark commented on SPARK-15615:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17071

> Support for creating a dataframe from JSON in Dataset[String] 
> --
>
> Key: SPARK-15615
> URL: https://issues.apache.org/jira/browse/SPARK-15615
> Project: Spark
>  Issue Type: Bug
>Reporter: PJ Fanning
>Assignee: PJ Fanning
> Fix For: 2.2.0
>
>
> We should deprecate DataFrameReader.scala json(rdd: RDD[String]) and support 
> json(ds: Dataset[String]) instead



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala

2017-02-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14772:
--
Fix Version/s: 2.2.0

> Python ML Params.copy treats uid, paramMaps differently than Scala
> --
>
> Key: SPARK-14772
> URL: https://issues.apache.org/jira/browse/SPARK-14772
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
> Fix For: 2.1.1, 2.2.0
>
>
> In PySpark, {{ml.param.Params.copy}} does not quite match the Scala 
> implementation:
> * It does not copy the UID
> * It does not respect the difference between defaultParamMap and paramMap.  
> This is an issue with {{_copyValues}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14772) Python ML Params.copy treats uid, paramMaps differently than Scala

2017-02-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14772.
---
   Resolution: Fixed
Fix Version/s: (was: 2.2.0)
   2.1.1

Issue resolved by pull request 17048
[https://github.com/apache/spark/pull/17048]

> Python ML Params.copy treats uid, paramMaps differently than Scala
> --
>
> Key: SPARK-14772
> URL: https://issues.apache.org/jira/browse/SPARK-14772
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Bryan Cutler
> Fix For: 2.1.1
>
>
> In PySpark, {{ml.param.Params.copy}} does not quite match the Scala 
> implementation:
> * It does not copy the UID
> * It does not respect the difference between defaultParamMap and paramMap.  
> This is an issue with {{_copyValues}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-25 Thread Ji Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884487#comment-15884487
 ] 

Ji Yan commented on SPARK-19740:


the problem is that when running Spark on Mesos, there is no way to run Spark 
executor as non-root user

> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14393) values generated by non-deterministic functions shouldn't change after coalesce or union

2017-02-25 Thread Everett Anderson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884486#comment-15884486
 ] 

Everett Anderson commented on SPARK-14393:
--

Appears this also happens in 2.0.2.

Thanks for fixing this!

I disagree that monotonically_increasing_id function should ever be allowed to 
create duplicate values in a table given its documentation is "The generated ID 
is guaranteed to be monotonically increasing and unique, but not consecutive." 

We were certainly relying on it to produce unique values!

> values generated by non-deterministic functions shouldn't change after 
> coalesce or union
> 
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0, 2.0.1
>Reporter: Jason Piper
>Assignee: Xiangrui Meng
>Priority: Blocker
>  Labels: correctness, releasenotes
> Fix For: 2.1.0
>
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-25 Thread Sunitha Kambhampati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884447#comment-15884447
 ] 

Sunitha Kambhampati commented on SPARK-19602:
-

I have made the changes to support three part column name. In order to aid in 
the review and to reduce the diff, the test scenarios are separated out into 
the above PR: 17067.

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Attachments: Design_ColResolution_JIRA19602.pdf
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19709) CSV datasource fails to read empty file

2017-02-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19709:


Assignee: Apache Spark

> CSV datasource fails to read empty file
> ---
>
> Key: SPARK-19709
> URL: https://issues.apache.org/jira/browse/SPARK-19709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> I just {{touch a}} and then ran the codes below:
> {code}
> scala> spark.read.csv("a")
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.
> {code}
> It seems we should produce an empty dataframe consistently with 
> `spark.read.json("a")`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19709) CSV datasource fails to read empty file

2017-02-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19709:


Assignee: (was: Apache Spark)

> CSV datasource fails to read empty file
> ---
>
> Key: SPARK-19709
> URL: https://issues.apache.org/jira/browse/SPARK-19709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> I just {{touch a}} and then ran the codes below:
> {code}
> scala> spark.read.csv("a")
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.
> {code}
> It seems we should produce an empty dataframe consistently with 
> `spark.read.json("a")`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19709) CSV datasource fails to read empty file

2017-02-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884446#comment-15884446
 ] 

Apache Spark commented on SPARK-19709:
--

User 'wojtek-szymanski' has created a pull request for this issue:
https://github.com/apache/spark/pull/17068

> CSV datasource fails to read empty file
> ---
>
> Key: SPARK-19709
> URL: https://issues.apache.org/jira/browse/SPARK-19709
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> I just {{touch a}} and then ran the codes below:
> {code}
> scala> spark.read.csv("a")
> java.util.NoSuchElementException: next on empty iterator
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
>   at scala.collection.Iterator$$anon$2.next(Iterator.scala:37)
>   at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.
> {code}
> It seems we should produce an empty dataframe consistently with 
> `spark.read.json("a")`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19602:


Assignee: (was: Apache Spark)

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Attachments: Design_ColResolution_JIRA19602.pdf
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884443#comment-15884443
 ] 

Apache Spark commented on SPARK-19602:
--

User 'skambha' has created a pull request for this issue:
https://github.com/apache/spark/pull/17067

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Attachments: Design_ColResolution_JIRA19602.pdf
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19602:


Assignee: Apache Spark

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
>Assignee: Apache Spark
> Attachments: Design_ColResolution_JIRA19602.pdf
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-25 Thread Sunitha Kambhampati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-19602:

Attachment: (was: Design_ColResolution_JIRA19602.docx)

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Attachments: Design_ColResolution_JIRA19602.pdf
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-25 Thread Sunitha Kambhampati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-19602:

Attachment: Design_ColResolution_JIRA19602.pdf

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Attachments: Design_ColResolution_JIRA19602.docx, 
> Design_ColResolution_JIRA19602.pdf
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-25 Thread Sunitha Kambhampati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-19602:

Attachment: Design_ColResolution_JIRA19602.docx

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Attachments: Design_ColResolution_JIRA19602.docx
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-25 Thread Sunitha Kambhampati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-19602:

Attachment: (was: Design_ColResolution_JIRA19602.docx)

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884432#comment-15884432
 ] 

Sean Owen commented on SPARK-19740:
---

What is the bug or Spark problem here?
(We use pull requests, not links to branches)

> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-25 Thread Ji Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884426#comment-15884426
 ] 

Ji Yan commented on SPARK-19740:


proposed change: 
https://github.com/yanji84/spark/commit/4f8368ea727e5689e96794884b8d1baf3eccb5d5

> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-25 Thread Ji Yan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ji Yan updated SPARK-19740:
---
Description: 
When running Spark on Mesos with docker containerizer, the spark executors are 
always launched with 'docker run' command without specifying --user option, 
which always results in spark executors running as root. Mesos has a way to 
support arbitrary parameters. Spark could use that to expose setting user

background on mesos with arbitrary parameters support: 
https://issues.apache.org/jira/browse/MESOS-1816


  was:When running Spark on Mesos with docker containerizer, the spark 
executors are always launched with 'docker run' command without specifying 
--user option, which always results in spark executors running as root. Mesos 
has a way to support arbitrary parameters. Spark could use that to expose 
setting user


> Spark executor always runs as root when running on mesos
> 
>
> Key: SPARK-19740
> URL: https://issues.apache.org/jira/browse/SPARK-19740
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.1.0
>Reporter: Ji Yan
>
> When running Spark on Mesos with docker containerizer, the spark executors 
> are always launched with 'docker run' command without specifying --user 
> option, which always results in spark executors running as root. Mesos has a 
> way to support arbitrary parameters. Spark could use that to expose setting 
> user
> background on mesos with arbitrary parameters support: 
> https://issues.apache.org/jira/browse/MESOS-1816



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19740) Spark executor always runs as root when running on mesos

2017-02-25 Thread Ji Yan (JIRA)
Ji Yan created SPARK-19740:
--

 Summary: Spark executor always runs as root when running on mesos
 Key: SPARK-19740
 URL: https://issues.apache.org/jira/browse/SPARK-19740
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.1.0
Reporter: Ji Yan


When running Spark on Mesos with docker containerizer, the spark executors are 
always launched with 'docker run' command without specifying --user option, 
which always results in spark executors running as root. Mesos has a way to 
support arbitrary parameters. Spark could use that to expose setting user



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15288) Mesos dispatcher should handle gracefully when any thread gets UncaughtException

2017-02-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15288.
---
   Resolution: Fixed
 Assignee: Devaraj K
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/13072

> Mesos dispatcher should handle gracefully when any thread gets 
> UncaughtException
> 
>
> Key: SPARK-15288
> URL: https://issues.apache.org/jira/browse/SPARK-15288
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Mesos
>Reporter: Devaraj K
>Assignee: Devaraj K
>Priority: Minor
> Fix For: 2.2.0
>
>
> Any one thread of the Mesos dispatcher gets any of the UncaughtException then 
> the thread gets terminate and the dispatcher process keeps running without 
> functioning properly. 
> I think we need to handle the UncaughtException and shutdown the Mesos 
> dispatcher.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19673) ThriftServer default app name is changed wrong

2017-02-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19673.
---
   Resolution: Fixed
Fix Version/s: 2.2.0

Resolved by https://github.com/apache/spark/pull/17010

> ThriftServer default app name is changed wrong
> --
>
> Key: SPARK-19673
> URL: https://issues.apache.org/jira/browse/SPARK-19673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: LvDongrong
>Assignee: LvDongrong
>Priority: Trivial
> Fix For: 2.2.0
>
>
> In spark 1.x ,the name of ThriftServer is SparkSQL:localHostName. While the 
> ThriftServer default name is changed to the className of HiveThfiftServer2 
> (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2) , which is not 
> appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19673) ThriftServer default app name is changed wrong

2017-02-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-19673:
-

Assignee: LvDongrong
Priority: Trivial  (was: Major)

> ThriftServer default app name is changed wrong
> --
>
> Key: SPARK-19673
> URL: https://issues.apache.org/jira/browse/SPARK-19673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: LvDongrong
>Assignee: LvDongrong
>Priority: Trivial
> Fix For: 2.2.0
>
>
> In spark 1.x ,the name of ThriftServer is SparkSQL:localHostName. While the 
> ThriftServer default name is changed to the className of HiveThfiftServer2 
> (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2) , which is not 
> appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19739) SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to propagate full set of AWS env vars

2017-02-25 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-19739:
--

 Summary: SparkHadoopUtil.appendS3AndSparkHadoopConfigurations to 
propagate full set of AWS env vars
 Key: SPARK-19739
 URL: https://issues.apache.org/jira/browse/SPARK-19739
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.0
Reporter: Steve Loughran
Priority: Minor


{{SparkHadoopUtil.appendS3AndSparkHadoopConfigurations()}} propagates the AWS 
user and secret key to s3n and s3a config options, so getting secrets from the 
user to the cluster, if set.

AWS also supports session authentication (env var {{AWS_SESSION_TOKEN}}) and 
region endpoints {{AWS_DEFAULT_REGION}}, the latter being critical if you want 
to address V4-auth-only endpoints like frankfurt and Seol. 

These env vars should be picked up and passed down to S3a too. 4+ lines of 
code, though impossible to test unless the existing code is refactored to take 
the env var map[String, String], so allowing a test suite to set the values in 
itds own map.

side issue: what if only half the env vars are set and users are trying to 
understand why auth is failing? It may be good to build up a string identifying 
which env vars had their value propagate, and log that @ debug, while not 
logging the values, obviously.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19725) different parquet dependency in spark2.0.x and Hive2.x cause failure of HoS when using parquet file format

2017-02-25 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884294#comment-15884294
 ] 

Sean Owen commented on SPARK-19725:
---

We wouldn't update this dependency in a maintenance release of 2.0.x or 2.1.x.

> different parquet dependency in spark2.0.x and Hive2.x cause failure of HoS 
> when using parquet file format
> --
>
> Key: SPARK-19725
> URL: https://issues.apache.org/jira/browse/SPARK-19725
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
> hive2.2
> hadoop2.7.1
>Reporter: KaiXu
>
> the parquet version in hive2.x is 1.8.1 while in spark2.0.x is 1.7.0, so when 
> run HoS queries using parquet file format would encounter some jars conflict 
> problems:
> Starting Spark Job = d1f6825c-48ea-45b8-9614-4266f2d1f0bd
> Job failed with java.lang.NoSuchMethodError: 
> org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
> FAILED: Execution Error, return code 3 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask. 
> java.util.concurrent.ExecutionException: Exception thrown by job
> at 
> org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272)
> at 
> org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:362)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in 
> stage 1.0 (TID 9, hsx-node7): java.lang.RuntimeException: Error processing 
> row: java.lang.NoSuchMethodError: 
> org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:149)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
> at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976)
> at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:100)
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:56)
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertTypes(HiveSchemaConverter.java:50)
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convert(HiveSchemaConverter.java:39)
> at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:115)
> at 
> org.apache.hadoop.hive.ql.io.HiveFileForma

[jira] [Updated] (SPARK-19733) ALS performs unnecessary casting on item and user ids

2017-02-25 Thread Vasilis Vryniotis (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vasilis Vryniotis updated SPARK-19733:
--
Affects Version/s: (was: 1.6.3)

> ALS performs unnecessary casting on item and user ids
> -
>
> Key: SPARK-19733
> URL: https://issues.apache.org/jira/browse/SPARK-19733
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Vasilis Vryniotis
>
> The ALS is performing unnecessary casting to the user and item ids (to 
> double). I believe this is because the protected checkedCast() method 
> requires a double input. This can be avoided by refactroing the code of 
> checkedCast method.
> Issue resolved by pull-request 17059:
> https://github.com/apache/spark/pull/17059



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14222) Cross-publish jackson-module-scala for Scala 2.12

2017-02-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14222.
---
Resolution: Done

jackson-module-scala is cross-published for 2.12 from version 2.7.9 and 2.8.+ 
onwards, which I think may be sufficient to call this part done.

> Cross-publish jackson-module-scala for Scala 2.12
> -
>
> Key: SPARK-14222
> URL: https://issues.apache.org/jira/browse/SPARK-14222
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In order to build Spark against Scala 2.12, we need to either remove our 
> jackson-module-scala dependency or cross-publish Jackson for Scala 2.12. 
> Personally, I'd prefer to remove it because I don't think we make extensive 
> use of it and because I'm not a huge fan of the implicit mapping between case 
> classes and JSON wire formats (the extra verbosity required by other 
> approaches is a feature, IMO, rather than a bug because it makes it much 
> harder to accidentally break wire compatibility).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14221) Cross-publish Chill for Scala 2.12

2017-02-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-14221.
---
Resolution: Done

Chill 0.8.2 and onwards are cross-published for 2.12, so I think that resolves 
this, as we're on 0.8.0 already.

> Cross-publish Chill for Scala 2.12
> --
>
> Key: SPARK-14221
> URL: https://issues.apache.org/jira/browse/SPARK-14221
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We need to cross-publish Chill in order to build against Scala 2.12.
> Upstream issue: https://github.com/twitter/chill/issues/252
> I tried building and testing {{chill-scala}} against 2.12.0-M3 and ran into 
> multiple failed tests due to issues with Java8 lambda serialization (similar 
> to https://github.com/EsotericSoftware/kryo/issues/215), so this task will be 
> slightly more involved then just bumping the dependencies in the Chill build.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19738) Consider adding error handler to DataStreamWriter

2017-02-25 Thread Jayesh lalwani (JIRA)
Jayesh lalwani created SPARK-19738:
--

 Summary: Consider adding error handler to DataStreamWriter
 Key: SPARK-19738
 URL: https://issues.apache.org/jira/browse/SPARK-19738
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Jayesh lalwani


For Structured streaming implementations, it is important that the applications 
stay always On. However, right now, errors stop the driver. In some cases, this 
is not desirable behavior. For example, I have the following application

{code}
import org.apache.spark.sql.types._
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = 
spark.readStream.schema(userSchema).csv("s3://bucket/jayesh/streamingerror/")
csvDF.writeStream.format("console").start()
{code}

I send the following input to it 

{quote}
1,Iron man
2,SUperman
{quote}

Obviously, the data is bad. This causes the executor to throw an exception that 
propogates to the driver, which promptly shuts down. The driver is running in 
supervised mode, and it gets restarted. The application reads the same bad 
input and shuts down again. This goes ad-infinitum. This behavior is desirable, 
in cases, the error is recoverable. For example, if the executor cannot talk to 
the database, we want the application to keep trying the same input again and 
again till the database recovers. However, for some cases, this behavior is 
undesirable. We do not want this to happen when the input is bad. We want to 
put the bad record in some sort of dead letter queue. Or maybe we want to kill 
the driver only when the number of errors have crossed a certain threshold. Or 
maybe we want to email someone.

Proposal:

Add a error handler to the data stream. When the executor fails, it should call 
the error handler and pass the Exception to the error handler. The error 
handler could eat the exception, or transform it, or update counts in an 
accumulator, etc

 {code}
import org.apache.spark.sql.types._
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = 
spark.readStream.schema(userSchema).csv("s3://bucket/jayesh/streamingerror/")
csvDF.writeStream.format("console").errorhandler("com.jayesh.ErrorHandler").start()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19715) Option to Strip Paths in FileSource

2017-02-25 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884230#comment-15884230
 ] 

Steve Loughran edited comment on SPARK-19715 at 2/25/17 1:24 PM:
-

OK. I'd recommend going with {{Path.getURI.getPath()}} to get the full path, 
though there's the always the risk of >1 s3a bucket referring to the same 
objects

Some filesystems (HDFS) have checksums you can ask for, though S3a doesn't, 
yet: HADOOP-13282 has discussed serving up etags, primarily to aid distcp 
updates. If added, you could use that as the differentiator, or at least to 
identify changed files. Patches welcome, [with 
tests|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/testing.md]

To be ruthless, it may have been simpler for the user just to edit the 
fs.s3n.impl binding to point to S3AFileSystem.class & then left the URLs the 
same


was (Author: ste...@apache.org):
OK. I'd recommend going twith Path.getURI.getPath() to get the full path, 
though there's the always the risk of >1 s3a bucket referring to the same 
objects

Some filesystems (HDFS, file:) have checksums you can ask for, though S3a 
doesn't, yet: HADOOP-13282 has discussed serving up etags, primarily to aid 
distcp updates. If added, you could use that as the differentiator, or at least 
to identify changed files

To be ruthless, it may have been simpler for the user just to edit the 
fs.s3n.impl binding to point to S3AFileSystem.class & then left the URLs the 
same

> Option to Strip Paths in FileSource
> ---
>
> Key: SPARK-19715
> URL: https://issues.apache.org/jira/browse/SPARK-19715
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>
> Today, we compare the whole path when deciding if a file is new in the 
> FileSource for structured streaming.  However, this cause cause false 
> negatives in the case where the path has changed in a cosmetic way (i.e. 
> changing s3n to s3a).  We should add an option {{fileNameOnly}} that causes 
> the new file check to be based only on the filename (but still store the 
> whole path in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19715) Option to Strip Paths in FileSource

2017-02-25 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884230#comment-15884230
 ] 

Steve Loughran commented on SPARK-19715:


OK. I'd recommend going twith Path.getURI.getPath() to get the full path, 
though there's the always the risk of >1 s3a bucket referring to the same 
objects

Some filesystems (HDFS, file:) have checksums you can ask for, though S3a 
doesn't, yet: HADOOP-13282 has discussed serving up etags, primarily to aid 
distcp updates. If added, you could use that as the differentiator, or at least 
to identify changed files

To be ruthless, it may have been simpler for the user just to edit the 
fs.s3n.impl binding to point to S3AFileSystem.class & then left the URLs the 
same

> Option to Strip Paths in FileSource
> ---
>
> Key: SPARK-19715
> URL: https://issues.apache.org/jira/browse/SPARK-19715
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>
> Today, we compare the whole path when deciding if a file is new in the 
> FileSource for structured streaming.  However, this cause cause false 
> negatives in the case where the path has changed in a cosmetic way (i.e. 
> changing s3n to s3a).  We should add an option {{fileNameOnly}} that causes 
> the new file check to be based only on the filename (but still store the 
> whole path in the log).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19725) different parquet dependency in spark2.0.x and Hive2.x cause failure of HoS when using parquet file format

2017-02-25 Thread KaiXu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXu updated SPARK-19725:
--
Description: 
the parquet version in hive2.x is 1.8.1 while in spark2.0.x is 1.7.0, so when 
run HoS queries using parquet file format would encounter some jars conflict 
problems:

Starting Spark Job = d1f6825c-48ea-45b8-9614-4266f2d1f0bd
Job failed with java.lang.NoSuchMethodError: 
org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
FAILED: Execution Error, return code 3 from 
org.apache.hadoop.hive.ql.exec.spark.SparkTask. 
java.util.concurrent.ExecutionException: Exception thrown by job
at 
org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272)
at org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:362)
at 
org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 
1.0 (TID 9, hsx-node7): java.lang.RuntimeException: Error processing row: 
java.lang.NoSuchMethodError: 
org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
at 
org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:149)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
at 
org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at 
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
at 
org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
at 
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976)
at 
org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodError: 
org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
at 
org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:100)
at 
org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:56)
at 
org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertTypes(HiveSchemaConverter.java:50)
at 
org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convert(HiveSchemaConverter.java:39)
at 
org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:115)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:286)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:271)
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketForFileIdx(FileSinkOperator.java:609)
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.createBucketFiles(FileSinkOperator.java:553)
at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:664)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorFileSinkOperator.process(VectorFileSinkOperator.java:101)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorSelectOperator.process(VectorSelectOperator.java:137)

  was:
the parquet version in hive2.x is 

[jira] [Commented] (SPARK-19725) different parquet dependency in spark2.x and Hive2.x cause failure of HoS when using parquet file format

2017-02-25 Thread KaiXu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884208#comment-15884208
 ] 

KaiXu commented on SPARK-19725:
---

 Hive supports spark2.x through HIVE-14029, spark supports Hive 2(SPARK-13446) 
is on the way.
Yes, maybe I should change the tittle to spark2.0.x.

> different parquet dependency in spark2.x and Hive2.x cause failure of HoS 
> when using parquet file format
> 
>
> Key: SPARK-19725
> URL: https://issues.apache.org/jira/browse/SPARK-19725
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.2
> Environment: spark2.0.2
> hive2.2
> hadoop2.7.1
>Reporter: KaiXu
>
> the parquet version in hive2.x is 1.8.1 while in spark2.x is 1.7.0, so when 
> run HoS queries using parquet file format would encounter some jars conflict 
> problems:
> Starting Spark Job = d1f6825c-48ea-45b8-9614-4266f2d1f0bd
> Job failed with java.lang.NoSuchMethodError: 
> org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
> FAILED: Execution Error, return code 3 from 
> org.apache.hadoop.hive.ql.exec.spark.SparkTask. 
> java.util.concurrent.ExecutionException: Exception thrown by job
> at 
> org.apache.spark.JavaFutureActionWrapper.getImpl(FutureAction.scala:272)
> at 
> org.apache.spark.JavaFutureActionWrapper.get(FutureAction.scala:277)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:362)
> at 
> org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:323)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in 
> stage 1.0 (TID 9, hsx-node7): java.lang.RuntimeException: Error processing 
> row: java.lang.NoSuchMethodError: 
> org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
> at 
> org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:149)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:48)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveMapFunctionResultList.processNextRecord(HiveMapFunctionResultList.java:27)
> at 
> org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList.hasNext(HiveBaseFunctionResultList.java:85)
> at 
> scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> at 
> org.apache.spark.rdd.AsyncRDDActions$$anonfun$foreachAsync$1$$anonfun$apply$12.apply(AsyncRDDActions.scala:127)
> at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976)
> at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1976)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.parquet.schema.Types$PrimitiveBuilder.length(I)Lorg/apache/parquet/schema/Types$BasePrimitiveBuilder;
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:100)
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertType(HiveSchemaConverter.java:56)
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convertTypes(HiveSchemaConverter.java:50)
> at 
> org.apache.hadoop.hive.ql.io.parquet.convert.HiveSchemaConverter.convert(HiveSchemaConverter.java:39)
> at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat.getHiveRecordWriter(MapredParquetOutputFormat.java:115)
>