[jira] [Assigned] (SPARK-13769) Java Doc needs update in SparkSubmit.scala

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13769:


Assignee: Apache Spark

> Java Doc needs update in SparkSubmit.scala
> --
>
> Key: SPARK-13769
> URL: https://issues.apache.org/jira/browse/SPARK-13769
> Project: Spark
>  Issue Type: Bug
>Reporter: Ahmed Kamal
>Assignee: Apache Spark
>Priority: Minor
>
> The java doc here 
> (https://github.com/apache/spark/blob/e97fc7f176f8bf501c9b3afd8410014e3b0e1602/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L51)
> needs to be updated from "The latter two operations are currently supported 
> only for standalone cluster mode." to "The latter two operations are 
> currently supported only for standalone and mesos cluster mode."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13769) Java Doc needs update in SparkSubmit.scala

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13769:


Assignee: (was: Apache Spark)

> Java Doc needs update in SparkSubmit.scala
> --
>
> Key: SPARK-13769
> URL: https://issues.apache.org/jira/browse/SPARK-13769
> Project: Spark
>  Issue Type: Bug
>Reporter: Ahmed Kamal
>Priority: Minor
>
> The java doc here 
> (https://github.com/apache/spark/blob/e97fc7f176f8bf501c9b3afd8410014e3b0e1602/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L51)
> needs to be updated from "The latter two operations are currently supported 
> only for standalone cluster mode." to "The latter two operations are 
> currently supported only for standalone and mesos cluster mode."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13769) Java Doc needs update in SparkSubmit.scala

2016-03-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186695#comment-15186695
 ] 

Apache Spark commented on SPARK-13769:
--

User 'AhmedKamal' has created a pull request for this issue:
https://github.com/apache/spark/pull/11600

> Java Doc needs update in SparkSubmit.scala
> --
>
> Key: SPARK-13769
> URL: https://issues.apache.org/jira/browse/SPARK-13769
> Project: Spark
>  Issue Type: Bug
>Reporter: Ahmed Kamal
>Priority: Minor
>
> The java doc here 
> (https://github.com/apache/spark/blob/e97fc7f176f8bf501c9b3afd8410014e3b0e1602/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L51)
> needs to be updated from "The latter two operations are currently supported 
> only for standalone cluster mode." to "The latter two operations are 
> currently supported only for standalone and mesos cluster mode."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13769) Java Doc needs update in SparkSubmit.scala

2016-03-08 Thread Ahmed Kamal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186689#comment-15186689
 ] 

Ahmed Kamal commented on SPARK-13769:
-

I have created a pull request to fix this issue

https://github.com/apache/spark/pull/11600

> Java Doc needs update in SparkSubmit.scala
> --
>
> Key: SPARK-13769
> URL: https://issues.apache.org/jira/browse/SPARK-13769
> Project: Spark
>  Issue Type: Bug
>Reporter: Ahmed Kamal
>Priority: Minor
>
> The java doc here 
> (https://github.com/apache/spark/blob/e97fc7f176f8bf501c9b3afd8410014e3b0e1602/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L51)
> needs to be updated from "The latter two operations are currently supported 
> only for standalone cluster mode." to "The latter two operations are 
> currently supported only for standalone and mesos cluster mode."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186679#comment-15186679
 ] 

Xiao Li edited comment on SPARK-13393 at 3/9/16 7:38 AM:
-

Yeah, I agree. CC [~marmbrus][~rxin]. We need to make a decision about it. 2.0 
is a good time to make such a change, if desired


was (Author: smilegator):
Yeah, I agree. CC [~marmbrus][~rxin]. They need to make a decision about it. 
2.0 is a good time to make such a change, if desired

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13770) Document the ML feature Interaction

2016-03-08 Thread Abbass Marouni (JIRA)
Abbass Marouni created SPARK-13770:
--

 Summary: Document the ML feature Interaction
 Key: SPARK-13770
 URL: https://issues.apache.org/jira/browse/SPARK-13770
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.6.0
Reporter: Abbass Marouni
Priority: Minor


The ML feature Interaction 
(http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/Interaction.html)
 is not included in the documentation of ML features. It'd be nice to provide a 
working example and some documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186679#comment-15186679
 ] 

Xiao Li commented on SPARK-13393:
-

Yeah, I agree. CC [~marmbrus][~rxin]. They need to make a decision about it. 
2.0 is a good time to make such a change, if desired

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186674#comment-15186674
 ] 

Adrian Wang commented on SPARK-13393:
-

That's the case we should throw exceptions.

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186673#comment-15186673
 ] 

Adrian Wang commented on SPARK-13393:
-

See my updated comment. That's not reasonable.

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186660#comment-15186660
 ] 

Adrian Wang edited comment on SPARK-13393 at 3/9/16 7:31 AM:
-

How do you resolve it? Both sides are `df`, so we can resolve df("key") to 
single side, which leads to a Cartesian join (4 output rows); or we can resolve 
to both sides (2 output rows). We are not able to tell what the user meant to.
The current design would not throw any exception because we assume same cols in 
condition are from different sides, as I have declared. I don't think that's a 
decent way.


was (Author: adrian-wang):
How do you resolve it? Both sides are `df`, so we can resolve df("key") to 
single side, which leads to a Cartesian join (4 output rows); or we can resolve 
to both sides (2 output rows). We are not able to tell what the user meant to.

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186665#comment-15186665
 ] 

Xiao Li commented on SPARK-13393:
-

Try it. You will see it works

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186661#comment-15186661
 ] 

Adrian Wang commented on SPARK-13393:
-

How do you resolve it? Both sides are `df`, so we can resolve df("key") to 
single side, which leads to a Cartesian join (4 output rows); or we can resolve 
to both sides (2 output rows). We are not able to tell what the user meant to.

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186660#comment-15186660
 ] 

Adrian Wang commented on SPARK-13393:
-

How do you resolve it? Both sides are `df`, so we can resolve df("key") to 
single side, which leads to a Cartesian join (4 output rows); or we can resolve 
to both sides (2 output rows). We are not able to tell what the user meant to.

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang updated SPARK-13393:

Comment: was deleted

(was: How do you resolve it? Both sides are `df`, so we can resolve df("key") 
to single side, which leads to a Cartesian join (4 output rows); or we can 
resolve to both sides (2 output rows). We are not able to tell what the user 
meant to.)

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186655#comment-15186655
 ] 

Xiao Li edited comment on SPARK-13393 at 3/9/16 7:25 AM:
-

This is not right. We can resolve it. For example,  

{code}
val df = Seq((1, "1"), (2, "2")).toDF("key", "value")
df.join(df, df("key") === df("key"))
{code}



was (Author: smilegator):
This is not right. We can resolve it. For example,  

{code}
val df = Seq((1, "1"), (2, "2")).toDF("key", "value")
df.join(df, df("key") === df("key")
{code}


> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186655#comment-15186655
 ] 

Xiao Li commented on SPARK-13393:
-

This is not right. We can resolve it. For example,  

{code}
val df = Seq((1, "1"), (2, "2")).toDF("key", "value")
df.join(df, df("key") === df("key")
{code}


> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186652#comment-15186652
 ] 

Adrian Wang commented on SPARK-13393:
-

In your example, df1("name") and df2("name") is exactly the same to each other, 
it's easy to throw an exception explicitly to tell user not to join 2 same 
dataframes without alias. We can do the same to this issue too.

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13769) Java Doc needs update in SparkSubmit.scala

2016-03-08 Thread Ahmed Kamal (JIRA)
Ahmed Kamal created SPARK-13769:
---

 Summary: Java Doc needs update in SparkSubmit.scala
 Key: SPARK-13769
 URL: https://issues.apache.org/jira/browse/SPARK-13769
 Project: Spark
  Issue Type: Bug
Reporter: Ahmed Kamal
Priority: Minor


The java doc here 
(https://github.com/apache/spark/blob/e97fc7f176f8bf501c9b3afd8410014e3b0e1602/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L51)
needs to be updated from "The latter two operations are currently supported 
only for standalone cluster mode." to "The latter two operations are currently 
supported only for standalone and mesos cluster mode."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13600) Use approxQuantile from DataFrame stats in QuantileDiscretizer

2016-03-08 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186648#comment-15186648
 ] 

Nick Pentreath commented on SPARK-13600:


Thanks, that's fine

> Use approxQuantile from DataFrame stats in QuantileDiscretizer
> --
>
> Key: SPARK-13600
> URL: https://issues.apache.org/jira/browse/SPARK-13600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Oliver Pierson
>Assignee: Oliver Pierson
>
> For consistency and code reuse, QuantileDiscretizer should use approxQuantile 
> to find splits in the data rather than implement it's own method.
> Additionally, making this change should remedy a bug where 
> QuantileDiscretizer fails to calculate the correct splits in certain 
> circumstances, resulting in an incorrect number of buckets/bins.
> E.g.
> val df = sc.parallelize(1.0 to 10.0 by 1.0).map(Tuple1.apply).toDF("x")
> val discretizer = new 
> QuantileDiscretizer().setInputCol("x").setOutputCol("y").setNumBuckets(5)
> discretizer.fit(df).getSplits
> gives:
> Array(-Infinity, 2.0, 4.0, 6.0, 8.0, 10.0, Infinity)
> which corresponds to 6 buckets (not 5).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186644#comment-15186644
 ] 

Xiao Li commented on SPARK-13393:
-

Fundamentally, they are the same issue. We need a solution for resolving it in 
a fundamental way. To users, this error is very confusing. 

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186642#comment-15186642
 ] 

Adrian Wang commented on SPARK-13393:
-

This is another issue; here we are talking about `varadha` and `df`, which is 
obviously different dataframes. For exactly the same dataframe, I think 
aliasing is still necessary.

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13768) Set hive conf failed use --hiveconf when beeline connect to thriftserver

2016-03-08 Thread Weizhong (JIRA)
Weizhong created SPARK-13768:


 Summary: Set hive conf failed use --hiveconf when beeline connect 
to thriftserver
 Key: SPARK-13768
 URL: https://issues.apache.org/jira/browse/SPARK-13768
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Weizhong
Priority: Minor


1. Start thriftserver
2. ./bin/beeline -u '...' --hiveconf hive.exec.max.dynamic.partitions=1
3. set hive.exec.max.dynamic.partitions;  --  return is default value 1000, not 
1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186629#comment-15186629
 ] 

Xiao Li commented on SPARK-13393:
-

: ) I do not understand how it works. Could you explains it a little bit more? 

For example, 

{code}
val df = sqlContext.createDataFrame(rdd)
val df1 = df;
val df2 = df1;
val df3 = df1.join(df2, df1("name") === df2("name"))
val df4 = df3.join(df2, df3("name") === df2("name"))
df4.show()
{code}

You will hit the error of ambiguity when you are doing df4.show()

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13767) py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server

2016-03-08 Thread Poonam Agrawal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Poonam Agrawal updated SPARK-13767:
---
Description: 
I am trying to create spark context object with the following commands on 
pyspark:

from pyspark import SparkContext, SparkConf
conf = 
SparkConf().setAppName('App_name').setMaster("spark://local-or-remote-ip:7077").set('spark.cassandra.connection.host',
 'cassandra-machine-ip').set('spark.storage.memoryFraction', 
'0.2').set('spark.rdd.compress', 'true').set('spark.streaming.blockInterval', 
500).set('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer').set('spark.scheduler.mode', 
'FAIR').set('spark.mesos.coarse', 'true')
sc = SparkContext(conf=conf)


but I am getting the following error:

Traceback (most recent call last):
File "", line 1, in 
File "/usr/local/lib/spark-1.4.1/python/pyspark/conf.py", line 106, in __init__
  self._jconf = _jvm.SparkConf(loadDefaults)
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 766, in __getattr__
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 362, in send_command
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 318, in _get_connection
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 325, in _create_connection
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 432, in start
py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
the Java server

I am getting the same error executing the command : conf = 
SparkConf().setAppName("App_name").setMaster("spark://127.0.0.1:7077")




  was:
I am trying to create spark context object with the following commands on 
pyspark:

from pyspark import SparkContext, SparkConf
conf = 
SparkConf().setAppName('App_name').setMaster("spark://local-or-remote-ip:7077").set('spark.cassandra.connection.host',
 'cassandra-machine-ip').set('spark.storage.memoryFraction', 
'0.2').set('spark.rdd.compress', 'true').set('spark.streaming.blockInterval', 
500).set('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer').set('spark.scheduler.mode', 
'FAIR').set('spark.mesos.coarse', 'true')
sc = SparkContext(conf=conf)


but I am getting the following error:

Traceback (most recent call last):
File "", line 1, in 
File "/usr/local/lib/spark-1.4.1/python/pyspark/conf.py", line 106, in __init__
  self._jconf = _jvm.SparkConf(loadDefaults)
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 766, in __getattr__
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 362, in send_command
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 318, in _get_connection
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 325, in _create_connection
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 432, in start
py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
the Java server





> py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
> the Java server
> 
>
> Key: SPARK-13767
> URL: https://issues.apache.org/jira/browse/SPARK-13767
> Project: Spark
>  Issue Type: Bug
>Reporter: Poonam Agrawal
>
> I am trying to create spark context object with the following commands on 
> pyspark:
> from pyspark import SparkContext, SparkConf
> conf = 
> SparkConf().setAppName('App_name').setMaster("spark://local-or-remote-ip:7077").set('spark.cassandra.connection.host',
>  'cassandra-machine-ip').set('spark.storage.memoryFraction', 
> '0.2').set('spark.rdd.compress', 'true').set('spark.streaming.blockInterval', 
> 500).set('spark.serializer', 
> 'org.apache.spark.serializer.KryoSerializer').set('spark.scheduler.mode', 
> 'FAIR').set('spark.mesos.coarse', 'true')
> sc = SparkContext(conf=conf)
> but I am getting the following error:
> Traceback (most recent call last):
> File "", line 1, in 
> File "/usr/local/lib/spark-1.4.1/python/pyspark/conf.py", line 106, in 
> __init__
>   self._jconf = _jvm.SparkConf(loadDefaults)
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 766, in __getattr__
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 362, in send_command
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  line 318, in _get_connection
> File 
> "/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>  

[jira] [Created] (SPARK-13767) py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server

2016-03-08 Thread Poonam Agrawal (JIRA)
Poonam Agrawal created SPARK-13767:
--

 Summary: py4j.protocol.Py4JNetworkError: An error occurred while 
trying to connect to the Java server
 Key: SPARK-13767
 URL: https://issues.apache.org/jira/browse/SPARK-13767
 Project: Spark
  Issue Type: Bug
Reporter: Poonam Agrawal


I am trying to create spark context object with the following commands on 
pyspark:

from pyspark import SparkContext, SparkConf
conf = 
SparkConf().setAppName('App_name').setMaster("spark://local-or-remote-ip:7077").set('spark.cassandra.connection.host',
 'cassandra-machine-ip').set('spark.storage.memoryFraction', 
'0.2').set('spark.rdd.compress', 'true').set('spark.streaming.blockInterval', 
500).set('spark.serializer', 
'org.apache.spark.serializer.KryoSerializer').set('spark.scheduler.mode', 
'FAIR').set('spark.mesos.coarse', 'true')
sc = SparkContext(conf=conf)


but I am getting the following error:

Traceback (most recent call last):
File "", line 1, in 
File "/usr/local/lib/spark-1.4.1/python/pyspark/conf.py", line 106, in __init__
  self._jconf = _jvm.SparkConf(loadDefaults)
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 766, in __getattr__
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 362, in send_command
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 318, in _get_connection
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 325, in _create_connection
File 
"/usr/local/lib/spark-1.4.1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
 line 432, in start
py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to 
the Java server






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186605#comment-15186605
 ] 

Adrian Wang commented on SPARK-13393:
-

So that's the reason I have to introduce the layer of `JoinedData` to keep left 
and right dataframe instance, then we can trace what the user wants to project 
with the specific dataframe info in Column instance (if exists).

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186601#comment-15186601
 ] 

Xiao Li commented on SPARK-13393:
-

Hi, [~adrian-wang]

You know, this is a well-known issue. The ambiguity issues still exist even if 
`Column` instances have the dataframe information. 

Thanks, 

Xiao

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13766) Inconsistent file extensions and omitted file extensions written by CSV, TEXT and JSON data sources

2016-03-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-13766:
-
Summary: Inconsistent file extensions and omitted file extensions written 
by CSV, TEXT and JSON data sources  (was: Inconsistent file extensions and 
omitting file extensions written by CSV, TEXT and JSON data sources)

> Inconsistent file extensions and omitted file extensions written by CSV, TEXT 
> and JSON data sources
> ---
>
> Key: SPARK-13766
> URL: https://issues.apache.org/jira/browse/SPARK-13766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, the output (part-files) from CSV, TEXT and JSON data sources do 
> not have file extensions such as .csv, .txt and .json (except for compression 
> extensions such as .gz, .deflate and .bz4).
> In addition, it looks Parquet has the extensions (in part-files) such as 
> .gz.parquet or .snappy.parquet according to compression codecs whereas ORC 
> does not have such extensions but it is just .orc.
> So, in a simple view, currently the extensions are set as below:
> {code}
> TEXT, CSV and JSON - [.COMPRESSION_CODEC_NAME]
> Parquet -  [.COMPRESSION_CODEC_NAME].parquet
> ORC - .orc
> {code}
> It would be great if we have a consistent naming for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13393) Column mismatch issue in left_outer join using Spark DataFrame

2016-03-08 Thread Adrian Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186580#comment-15186580
 ] 

Adrian Wang commented on SPARK-13393:
-

Hi [~srinathsmn]

In this `errorDF`, both `df('id')` and `varadha('id')` has the same 
`exprId`(they all come from `df`), so we cannot disambiguate between them in 
this design now.

As a workaround,  you should write code like `correctDF`, assign an alias for 
the columns first, or you can register df as table and then use a complete SQL 
query to get your data.

I think this is a bug under current design. I think we should put down the 
dataframe information in `Column` instances, and use an interval representation 
of `JoindeData` as the return value of `def join()`, in order to resolve 
ambiguity caused by self-join. For now, even I write something like

val errorDF = df.join(varadha, df("id") === df("id"), 
"left_outer").select(df("id"), varadha("id") as "varadha_id")

The result would still be the same, since we are assuming condition with 
ambiguity should always be resolved to both sides.

I can draft a design doc for this if you are interested.
cc [~smilegator][~rxin][~marmbrus]

> Column mismatch issue in left_outer join using Spark DataFrame
> --
>
> Key: SPARK-13393
> URL: https://issues.apache.org/jira/browse/SPARK-13393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Varadharajan
>
> Consider the below snippet:
> {code:title=test.scala|borderStyle=solid}
> case class Person(id: Int, name: String)
> val df = sc.parallelize(List(
>   Person(1, "varadha"),
>   Person(2, "nagaraj")
> )).toDF
> val varadha = df.filter("id = 1")
> val errorDF = df.join(varadha, df("id") === varadha("id"), 
> "left_outer").select(df("id"), varadha("id") as "varadha_id")
> val nagaraj = df.filter("id = 2").select(df("id") as "n_id")
> val correctDF = df.join(nagaraj, df("id") === nagaraj("n_id"), 
> "left_outer").select(df("id"), nagaraj("n_id") as "nagaraj_id")
> {code}
> The `errorDF` dataframe, after the left join is messed up and shows as below:
> | id|varadha_id|
> |  1| 1|
> |  2| 2 (*This should've been null*)| 
> whereas correctDF has the correct output after the left join:
> | id|nagaraj_id|
> |  1|  null|
> |  2| 2|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13766) Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources

2016-03-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-13766:
-
Description: 
Currently, the output (part-files) from CSV, TEXT and JSON data sources do not 
have file extensions such as .csv, .txt and .json (except for compression 
extensions such as .gz, .deflate and .bz4).

In addition, it looks Parquet has the extensions (in part-files) such as 
.gz.parquet or .snappy.parquet according to compression codecs whereas ORC does 
not have such extensions but it is just .orc.

So, in a simple view,

{code}
TEXT, CSV and JSON - [.COMPRESSION_CODEC_NAME]
Parquet -  [.COMPRESSION_CODEC_NAME].parquet
ORC - .orc
{code}

It would be great if we have a consistent naming for them

  was:
Currently, the output (part-files) from CSV, TEXT and JSON data sources does 
not have file extensions such as .csv, .txt and .json (except for compression 
extensions such as .gz, .deflate and .bz4).

In addition, it looks Parquet has the extensions (in part-files) such as 
.gz.parquet or .snappy.parquet according to compression codecs whereas ORC does 
not have such extensions but it is just .orc.

It would be great if we have a consistent naming for them


> Inconsistent file extensions and omitting file extensions written by CSV, 
> TEXT and JSON data sources
> 
>
> Key: SPARK-13766
> URL: https://issues.apache.org/jira/browse/SPARK-13766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, the output (part-files) from CSV, TEXT and JSON data sources do 
> not have file extensions such as .csv, .txt and .json (except for compression 
> extensions such as .gz, .deflate and .bz4).
> In addition, it looks Parquet has the extensions (in part-files) such as 
> .gz.parquet or .snappy.parquet according to compression codecs whereas ORC 
> does not have such extensions but it is just .orc.
> So, in a simple view,
> {code}
> TEXT, CSV and JSON - [.COMPRESSION_CODEC_NAME]
> Parquet -  [.COMPRESSION_CODEC_NAME].parquet
> ORC - .orc
> {code}
> It would be great if we have a consistent naming for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13766) Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources

2016-03-08 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-13766:
-
Description: 
Currently, the output (part-files) from CSV, TEXT and JSON data sources do not 
have file extensions such as .csv, .txt and .json (except for compression 
extensions such as .gz, .deflate and .bz4).

In addition, it looks Parquet has the extensions (in part-files) such as 
.gz.parquet or .snappy.parquet according to compression codecs whereas ORC does 
not have such extensions but it is just .orc.

So, in a simple view, currently the extensions are set as below:

{code}
TEXT, CSV and JSON - [.COMPRESSION_CODEC_NAME]
Parquet -  [.COMPRESSION_CODEC_NAME].parquet
ORC - .orc
{code}

It would be great if we have a consistent naming for them

  was:
Currently, the output (part-files) from CSV, TEXT and JSON data sources do not 
have file extensions such as .csv, .txt and .json (except for compression 
extensions such as .gz, .deflate and .bz4).

In addition, it looks Parquet has the extensions (in part-files) such as 
.gz.parquet or .snappy.parquet according to compression codecs whereas ORC does 
not have such extensions but it is just .orc.

So, in a simple view,

{code}
TEXT, CSV and JSON - [.COMPRESSION_CODEC_NAME]
Parquet -  [.COMPRESSION_CODEC_NAME].parquet
ORC - .orc
{code}

It would be great if we have a consistent naming for them


> Inconsistent file extensions and omitting file extensions written by CSV, 
> TEXT and JSON data sources
> 
>
> Key: SPARK-13766
> URL: https://issues.apache.org/jira/browse/SPARK-13766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, the output (part-files) from CSV, TEXT and JSON data sources do 
> not have file extensions such as .csv, .txt and .json (except for compression 
> extensions such as .gz, .deflate and .bz4).
> In addition, it looks Parquet has the extensions (in part-files) such as 
> .gz.parquet or .snappy.parquet according to compression codecs whereas ORC 
> does not have such extensions but it is just .orc.
> So, in a simple view, currently the extensions are set as below:
> {code}
> TEXT, CSV and JSON - [.COMPRESSION_CODEC_NAME]
> Parquet -  [.COMPRESSION_CODEC_NAME].parquet
> ORC - .orc
> {code}
> It would be great if we have a consistent naming for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13766) Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources

2016-03-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186576#comment-15186576
 ] 

Hyukjin Kwon commented on SPARK-13766:
--

I will work on this.

> Inconsistent file extensions and omitting file extensions written by CSV, 
> TEXT and JSON data sources
> 
>
> Key: SPARK-13766
> URL: https://issues.apache.org/jira/browse/SPARK-13766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, the output (part-files) from CSV, TEXT and JSON data sources does 
> not have file extensions such as .csv, .txt and .json (except for compression 
> extensions such as .gz, .deflate and .bz4).
> In addition, it looks Parquet has the extensions (in part-files) such as 
> .gz.parquet or .snappy.parquet according to compression codecs whereas ORC 
> does not have such extensions but it is just .orc.
> It would be great if we have a consistent naming for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13766) Inconsistent file extensions and omitting file extensions written by CSV, TEXT and JSON data sources

2016-03-08 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-13766:


 Summary: Inconsistent file extensions and omitting file extensions 
written by CSV, TEXT and JSON data sources
 Key: SPARK-13766
 URL: https://issues.apache.org/jira/browse/SPARK-13766
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Minor


Currently, the output (part-files) from CSV, TEXT and JSON data sources does 
not have file extensions such as .csv, .txt and .json (except for compression 
extensions such as .gz, .deflate and .bz4).

In addition, it looks Parquet has the extensions (in part-files) such as 
.gz.parquet or .snappy.parquet according to compression codecs whereas ORC does 
not have such extensions but it is just .orc.

It would be great if we have a consistent naming for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13765) method specialStateTransition(int, IntStream) is exceeding the 65535 bytes limit

2016-03-08 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-13765:

Description: 
Eclipse-Scala  IDE is complaining on Java Problem (*please see attached 
screenshot*), but IntelliJ is not complaining about it.

I'm not sure if it is a bug or not.

{code}
The code of method specialStateTransition(int, IntStream) is exceeding the 
65535 bytes limit

SparkSqlParser_IdentifiersParser.java   
/spark-catalyst_2.11/target/generated-sources/antlr3/org/apache/spark/sql/catalyst/parser
   line 40380
{code}



  was:
IDE is complaining on Java Problem (attached screenshot)

{code}
The code of method specialStateTransition(int, IntStream) is exceeding the 
65535 bytes limit

SparkSqlParser_IdentifiersParser.java   
/spark-catalyst_2.11/target/generated-sources/antlr3/org/apache/spark/sql/catalyst/parser
   line 40380
{code}




> method specialStateTransition(int, IntStream) is exceeding the 65535 bytes 
> limit
> 
>
> Key: SPARK-13765
> URL: https://issues.apache.org/jira/browse/SPARK-13765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Eclipse-Scala IDE
> sitting on master branch
>Reporter: Xin Ren
> Attachments: Screen Shot 2016-03-08 at 9.52.48 PM.png
>
>
> Eclipse-Scala  IDE is complaining on Java Problem (*please see attached 
> screenshot*), but IntelliJ is not complaining about it.
> I'm not sure if it is a bug or not.
> {code}
> The code of method specialStateTransition(int, IntStream) is exceeding the 
> 65535 bytes limit
> SparkSqlParser_IdentifiersParser.java 
> /spark-catalyst_2.11/target/generated-sources/antlr3/org/apache/spark/sql/catalyst/parser
>line 40380
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13765) method specialStateTransition(int, IntStream) is exceeding the 65535 bytes limit

2016-03-08 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-13765:

Description: 
IDE is complaining on Java Problem (attached screenshot)

{code}
The code of method specialStateTransition(int, IntStream) is exceeding the 
65535 bytes limit

SparkSqlParser_IdentifiersParser.java   
/spark-catalyst_2.11/target/generated-sources/antlr3/org/apache/spark/sql/catalyst/parser
   line 40380
{code}



  was:
IDE is complaining on Java Problem (attached screenshot)

The code of method specialStateTransition(int, IntStream) is exceeding the 
65535 bytes limit

SparkSqlParser_IdentifiersParser.java   
/spark-catalyst_2.11/target/generated-sources/antlr3/org/apache/spark/sql/catalyst/parser
   line 40380




> method specialStateTransition(int, IntStream) is exceeding the 65535 bytes 
> limit
> 
>
> Key: SPARK-13765
> URL: https://issues.apache.org/jira/browse/SPARK-13765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Eclipse-Scala IDE
> sitting on master branch
>Reporter: Xin Ren
> Attachments: Screen Shot 2016-03-08 at 9.52.48 PM.png
>
>
> IDE is complaining on Java Problem (attached screenshot)
> {code}
> The code of method specialStateTransition(int, IntStream) is exceeding the 
> 65535 bytes limit
> SparkSqlParser_IdentifiersParser.java 
> /spark-catalyst_2.11/target/generated-sources/antlr3/org/apache/spark/sql/catalyst/parser
>line 40380
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13765) method specialStateTransition(int, IntStream) is exceeding the 65535 bytes limit

2016-03-08 Thread Xin Ren (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-13765:

Attachment: Screen Shot 2016-03-08 at 9.52.48 PM.png

> method specialStateTransition(int, IntStream) is exceeding the 65535 bytes 
> limit
> 
>
> Key: SPARK-13765
> URL: https://issues.apache.org/jira/browse/SPARK-13765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Eclipse-Scala IDE
> sitting on master branch
>Reporter: Xin Ren
> Attachments: Screen Shot 2016-03-08 at 9.52.48 PM.png
>
>
> IDE is complaining on Java Problem (attached screenshot)
> The code of method specialStateTransition(int, IntStream) is exceeding the 
> 65535 bytes limit
> SparkSqlParser_IdentifiersParser.java 
> /spark-catalyst_2.11/target/generated-sources/antlr3/org/apache/spark/sql/catalyst/parser
>line 40380



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13765) method specialStateTransition(int, IntStream) is exceeding the 65535 bytes limit

2016-03-08 Thread Xin Ren (JIRA)
Xin Ren created SPARK-13765:
---

 Summary: method specialStateTransition(int, IntStream) is 
exceeding the 65535 bytes limit
 Key: SPARK-13765
 URL: https://issues.apache.org/jira/browse/SPARK-13765
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
 Environment: Eclipse-Scala IDE

sitting on master branch
Reporter: Xin Ren


IDE is complaining on Java Problem (attached screenshot)

The code of method specialStateTransition(int, IntStream) is exceeding the 
65535 bytes limit

SparkSqlParser_IdentifiersParser.java   
/spark-catalyst_2.11/target/generated-sources/antlr3/org/apache/spark/sql/catalyst/parser
   line 40380





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13764) Parse modes in JSON data source

2016-03-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186557#comment-15186557
 ] 

Hyukjin Kwon commented on SPARK-13764:
--

I will try to work on this (after looking a bit deeper).

> Parse modes in JSON data source
> ---
>
> Key: SPARK-13764
> URL: https://issues.apache.org/jira/browse/SPARK-13764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, JSON data source just fails to read if some JSON documents are 
> malformed.
> Therefore, if there are two JSON documents below:
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> This will fail emitting the exception below :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> So, just like the parse modes in CSV data source, (See 
> https://github.com/databricks/spark-csv), it would be great if there are some 
> parse modes so that users do not have to filter or pre-process themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13719) Bad JSON record raises java.lang.ClassCastException

2016-03-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186554#comment-15186554
 ] 

Hyukjin Kwon commented on SPARK-13719:
--

I opened a JIRA here, SPARK-13764. 

Could we maybe make this JIRA as a duplicate since this JIRA is not suggesting 
some possible ways?

> Bad JSON record raises java.lang.ClassCastException
> 
>
> Key: SPARK-13719
> URL: https://issues.apache.org/jira/browse/SPARK-13719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: OS X, Linux
>Reporter: dmtran
>Priority: Minor
>
> I have defined a JSON schema, using org.apache.spark.sql.types.StructType, 
> that expects this kind of record :
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> There's a bad record in my dataset, that defines field "user" as an array, 
> instead of a JSON object :
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> The following exception is raised because of that bad record :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Here's a code snippet that reproduces the exception :
> {noformat}
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.{SQLContext, DataFrame}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> object Snippet {
>   def main(args : Array[String]): Unit = {
> val sc = new SparkContext()
> implicit val sqlContext = new HiveContext(sc)
> val rdd: RDD[String] = sc.parallelize(Seq(badRecord))
> val df: DataFrame = sqlContext.read.schema(schema).json(rdd)
> import sqlContext.implicits._
> df.select("request.user.id")
>   .filter($"id".isNotNull)
>   .count()
>   }
>   val badRecord =
> s"""{
> |  "request": {
> |"user": []
> |  }
> |}""".stripMargin.replaceAll("\n", 

[jira] [Created] (SPARK-13764) Parse modes in JSON data source

2016-03-08 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-13764:


 Summary: Parse modes in JSON data source
 Key: SPARK-13764
 URL: https://issues.apache.org/jira/browse/SPARK-13764
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Minor


Currently, JSON data source just fails to read if some JSON documents are 
malformed.

Therefore, if there are two JSON documents below:

{noformat}
{
  "request": {
"user": {
  "id": 123
}
  }
}
{noformat}

{noformat}
{
  "request": {
"user": []
  }
}
{noformat}

This will fail emitting the exception below :
{noformat}
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost 
task 7.3 in stage 0.0 (TID 10, 192.168.1.170): java.lang.ClassCastException: 
org.apache.spark.sql.types.GenericArrayData cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
at 
org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
at 
org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

So, just like the parse modes in CSV data source, (See 
https://github.com/databricks/spark-csv), it would be great if there are some 
parse modes so that users do not have to filter or pre-process themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13719) Bad JSON record raises java.lang.ClassCastException

2016-03-08 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186521#comment-15186521
 ] 

Reynold Xin commented on SPARK-13719:
-

Yes - that would be great if possible.


> Bad JSON record raises java.lang.ClassCastException
> 
>
> Key: SPARK-13719
> URL: https://issues.apache.org/jira/browse/SPARK-13719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: OS X, Linux
>Reporter: dmtran
>Priority: Minor
>
> I have defined a JSON schema, using org.apache.spark.sql.types.StructType, 
> that expects this kind of record :
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> There's a bad record in my dataset, that defines field "user" as an array, 
> instead of a JSON object :
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> The following exception is raised because of that bad record :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Here's a code snippet that reproduces the exception :
> {noformat}
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.{SQLContext, DataFrame}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> object Snippet {
>   def main(args : Array[String]): Unit = {
> val sc = new SparkContext()
> implicit val sqlContext = new HiveContext(sc)
> val rdd: RDD[String] = sc.parallelize(Seq(badRecord))
> val df: DataFrame = sqlContext.read.schema(schema).json(rdd)
> import sqlContext.implicits._
> df.select("request.user.id")
>   .filter($"id".isNotNull)
>   .count()
>   }
>   val badRecord =
> s"""{
> |  "request": {
> |"user": []
> |  }
> |}""".stripMargin.replaceAll("\n", " ") // Convert the multiline 
> string to a signe line string
>   val schema =
> StructType(
> 

[jira] [Commented] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition

2016-03-08 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186488#comment-15186488
 ] 

Sital Kedia commented on SPARK-5581:


[~joshrosen] - The issue is not only open/close the file output stream for each 
partition. But also we are flushing the data to disk for each partitions. So 
when the partition size is small and we have many partitions, the cost of disk 
i/o is very high. Do you have any idea how we can avoid that?

> When writing sorted map output file, avoid open / close between each partition
> --
>
> Key: SPARK-5581
> URL: https://issues.apache.org/jira/browse/SPARK-5581
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>
> {code}
>   // Bypassing merge-sort; get an iterator by partition and just write 
> everything directly.
>   for ((id, elements) <- this.partitionedIterator) {
> if (elements.hasNext) {
>   val writer = blockManager.getDiskWriter(
> blockId, outputFile, ser, fileBufferSize, 
> context.taskMetrics.shuffleWriteMetrics.get)
>   for (elem <- elements) {
> writer.write(elem)
>   }
>   writer.commitAndClose()
>   val segment = writer.fileSegment()
>   lengths(id) = segment.length
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13763) Remove Project when its projectList is Empty

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13763:


Assignee: Apache Spark

> Remove Project when its projectList is Empty
> 
>
> Key: SPARK-13763
> URL: https://issues.apache.org/jira/browse/SPARK-13763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> We are using 'SELECT 1' as a dummy table, when the table is used for SQL 
> statements in which a table reference is required, but the contents of the 
> table are not important. For example, 
> {code}
> SELECT pageid, adid FROM (SELECT 1) dummyTable LATERAL VIEW 
> explode(adid_list) adTable AS adid;
> {code}
> In this case, we will see a useless Project whose projectList is empty after 
> executing ColumnPruning rule. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13763) Remove Project when its projectList is Empty

2016-03-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186479#comment-15186479
 ] 

Apache Spark commented on SPARK-13763:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11599

> Remove Project when its projectList is Empty
> 
>
> Key: SPARK-13763
> URL: https://issues.apache.org/jira/browse/SPARK-13763
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Xiao Li
>
> We are using 'SELECT 1' as a dummy table, when the table is used for SQL 
> statements in which a table reference is required, but the contents of the 
> table are not important. For example, 
> {code}
> SELECT pageid, adid FROM (SELECT 1) dummyTable LATERAL VIEW 
> explode(adid_list) adTable AS adid;
> {code}
> In this case, we will see a useless Project whose projectList is empty after 
> executing ColumnPruning rule. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13763) Remove Project when its projectList is Empty

2016-03-08 Thread Xiao Li (JIRA)
Xiao Li created SPARK-13763:
---

 Summary: Remove Project when its projectList is Empty
 Key: SPARK-13763
 URL: https://issues.apache.org/jira/browse/SPARK-13763
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Xiao Li


We are using 'SELECT 1' as a dummy table, when the table is used for SQL 
statements in which a table reference is required, but the contents of the 
table are not important. For example, 
{code}
SELECT pageid, adid FROM (SELECT 1) dummyTable LATERAL VIEW explode(adid_list) 
adTable AS adid;
{code}

In this case, we will see a useless Project whose projectList is empty after 
executing ColumnPruning rule. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13392) KafkaSink for Metrics

2016-03-08 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186445#comment-15186445
 ] 

Liwei Lin commented on SPARK-13392:
---

Thanks for working on this!

You don't have to be assigned to issue a PR on github. Once that PR is merged, 
this JIRA will be assigned to you.

> KafkaSink for Metrics
> -
>
> Key: SPARK-13392
> URL: https://issues.apache.org/jira/browse/SPARK-13392
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: UTKARSH BHATNAGAR
>Priority: Minor
>
> I would like to push metrics from Spark jobs to directly into a Kafka topic 
> via KafkaSink. Will write KafkaSink asap and submit a PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12697) Allow adding new streams without stopping Spark streaming context

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12697:


Assignee: (was: Apache Spark)

> Allow adding new streams without stopping Spark streaming context
> -
>
> Key: SPARK-12697
> URL: https://issues.apache.org/jira/browse/SPARK-12697
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Ubantu
>Reporter: Johny Mathew
>  Labels: features
>
> I am analyzing streaming input from kafka. For example I calculate average 
> use of a resource over a period of time. It works great.
> The problem is that if I need to make a change to the analysis, for example 
> change from average to max or add another item to the list of item to analyze 
> then I have to stop the spark application and restart it (at least I have to 
> stop the streaming context and restart the context).
> I would think it will be a great addition if spark can add new streams on the 
> fly or make modifications on the existing stream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13743) Adding configurable support for Spark Streaming gracefull timeout

2016-03-08 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186431#comment-15186431
 ] 

Liwei Lin commented on SPARK-13743:
---

[~nyuval] thanks for reporting this!
Here 1 hour is just the maximum await time, while the actual await time can be 
much shorter if there's not much data left to process.
For a streaming application, I think 1 hour is enough as a upper bound, right?

> Adding configurable support for Spark Streaming gracefull timeout
> -
>
> Key: SPARK-13743
> URL: https://issues.apache.org/jira/browse/SPARK-13743
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Yuval Itzchakov
>Priority: Minor
>
> Spark Streaming supports gracefull shutdown via the 
> "spark.streaming.stopGracefullyOnShutdown" configuration property. The actual 
> gracefull shutdown period for the spark job is hardcoded into  
> `JobScheduler.stop()`:
> // Wait for the queued jobs to complete if indicated
> val terminated = if (processAllReceivedData) {
>   jobExecutor.awaitTermination(1, TimeUnit.HOURS)  // just a very large 
> period of time
> } else {
>   jobExecutor.awaitTermination(2, TimeUnit.SECONDS)
> }
> I think we can greatly benefit if this setting wasn't hardcoded to an hour, 
> but configurable via an additional spark configuration key.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12697) Allow adding new streams without stopping Spark streaming context

2016-03-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186432#comment-15186432
 ] 

Apache Spark commented on SPARK-12697:
--

User 'zuowang' has created a pull request for this issue:
https://github.com/apache/spark/pull/11598

> Allow adding new streams without stopping Spark streaming context
> -
>
> Key: SPARK-12697
> URL: https://issues.apache.org/jira/browse/SPARK-12697
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Ubantu
>Reporter: Johny Mathew
>  Labels: features
>
> I am analyzing streaming input from kafka. For example I calculate average 
> use of a resource over a period of time. It works great.
> The problem is that if I need to make a change to the analysis, for example 
> change from average to max or add another item to the list of item to analyze 
> then I have to stop the spark application and restart it (at least I have to 
> stop the streaming context and restart the context).
> I would think it will be a great addition if spark can add new streams on the 
> fly or make modifications on the existing stream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12697) Allow adding new streams without stopping Spark streaming context

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12697:


Assignee: Apache Spark

> Allow adding new streams without stopping Spark streaming context
> -
>
> Key: SPARK-12697
> URL: https://issues.apache.org/jira/browse/SPARK-12697
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Ubantu
>Reporter: Johny Mathew
>Assignee: Apache Spark
>  Labels: features
>
> I am analyzing streaming input from kafka. For example I calculate average 
> use of a resource over a period of time. It works great.
> The problem is that if I need to make a change to the analysis, for example 
> change from average to max or add another item to the list of item to analyze 
> then I have to stop the spark application and restart it (at least I have to 
> stop the streaming context and restart the context).
> I would think it will be a great addition if spark can add new streams on the 
> fly or make modifications on the existing stream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13762) support column names only in schema string at createDataFrame

2016-03-08 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13762:
---

 Summary: support column names only in schema string at 
createDataFrame
 Key: SPARK-13762
 URL: https://issues.apache.org/jira/browse/SPARK-13762
 Project: Spark
  Issue Type: Improvement
Reporter: Wenchen Fan


for example, `createDataFrame(rdd, "a b c")`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13762) support only column names in schema string at createDataFrame

2016-03-08 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-13762:

Summary: support only column names in schema string at createDataFrame  
(was: support column names only in schema string at createDataFrame)

> support only column names in schema string at createDataFrame
> -
>
> Key: SPARK-13762
> URL: https://issues.apache.org/jira/browse/SPARK-13762
> Project: Spark
>  Issue Type: Improvement
>Reporter: Wenchen Fan
>
> for example, `createDataFrame(rdd, "a b c")`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13761) Deprecate validateParams

2016-03-08 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186405#comment-15186405
 ] 

yuhao yang commented on SPARK-13761:


Hi [~josephkb], do you mind if I work on this?

> Deprecate validateParams
> 
>
> Key: SPARK-13761
> URL: https://issues.apache.org/jira/browse/SPARK-13761
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Deprecate validateParams() method here: 
> [https://github.com/apache/spark/blob/035d3acdf3c1be5b309a861d5c5beb803b946b5e/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L553]
> Move all functionality in overridden methods to transformSchema().
> Check docs to make sure they indicate complex Param interaction checks should 
> be done in transformSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7146) Should ML sharedParams be a public API?

2016-03-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7146:
-
Target Version/s: 2.0.0  (was: )

> Should ML sharedParams be a public API?
> ---
>
> Key: SPARK-7146
> URL: https://issues.apache.org/jira/browse/SPARK-7146
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Discussion: Should the Param traits in sharedParams.scala be public?
> Pros:
> * Sharing the Param traits helps to encourage standardized Param names and 
> documentation.
> Cons:
> * Users have to be careful since parameters can have different meanings for 
> different algorithms.
> * If the shared Params are public, then implementations could test for the 
> traits.  It is unclear if we want users to rely on these traits, which are 
> somewhat experimental.
> Currently, the shared params are private.
> Proposal: Either
> (a) make the shared params private to encourage users to write specialized 
> documentation and value checks for parameters, or
> (b) design a better way to encourage overriding documentation and parameter 
> value checks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13761) Deprecate validateParams

2016-03-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13761:
--
Description: 
Deprecate validateParams() method here: 
[https://github.com/apache/spark/blob/035d3acdf3c1be5b309a861d5c5beb803b946b5e/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L553]

Move all functionality in overridden methods to transformSchema().

Check docs to make sure they indicate complex Param interaction checks should 
be done in transformSchema.

> Deprecate validateParams
> 
>
> Key: SPARK-13761
> URL: https://issues.apache.org/jira/browse/SPARK-13761
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Deprecate validateParams() method here: 
> [https://github.com/apache/spark/blob/035d3acdf3c1be5b309a861d5c5beb803b946b5e/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L553]
> Move all functionality in overridden methods to transformSchema().
> Check docs to make sure they indicate complex Param interaction checks should 
> be done in transformSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10620) Look into whether accumulator mechanism can replace TaskMetrics

2016-03-08 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186379#comment-15186379
 ] 

Liwei Lin commented on SPARK-10620:
---

hi [~andrewor14], in the "\[3\] A Simpler Accumulator API" section of the 
design doc:
{quote}
Since the design of this is mostly orthogonal to the rest of this document, 
here we only outline 
the desire for a new, simpler API, and does not discuss the solution. The 
actual design will be in 
a separate design doc.
{quote}

Anywhere to find that separate "Simpler Accumulator API" design doc please? 
Thanks!

> Look into whether accumulator mechanism can replace TaskMetrics
> ---
>
> Key: SPARK-10620
> URL: https://issues.apache.org/jira/browse/SPARK-10620
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
> Fix For: 2.0.0
>
> Attachments: accums-and-task-metrics.pdf
>
>
> This task is simply to explore whether the internal representation used by 
> TaskMetrics could be performed by using accumulators rather than having two 
> separate mechanisms. Note that we need to continue to preserve the existing 
> "Task Metric" data structures that are exposed to users through event logs 
> etc. The question is can we use a single internal codepath and perhaps make 
> this easier to extend in the future.
> I think a full exploration would answer the following questions:
> - How do the semantics of accumulators on stage retries differ from aggregate 
> TaskMetrics for a stage? Could we implement clearer retry semantics for 
> internal accumulators to allow them to be the same - for instance, zeroing 
> accumulator values if a stage is retried (see discussion here: SPARK-10042).
> - Are there metrics that do not fit well into the accumulator model, or would 
> be difficult to update as an accumulator.
> - If we expose metrics through accumulators in the future rather than 
> continuing to add fields to TaskMetrics, what is the best way to coerce 
> compatibility?
> - Are there any other considerations?
> - Is it worth it to do this, or is the consolidation too complicated to 
> justify?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13761) Deprecate validateParams

2016-03-08 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-13761:
-

 Summary: Deprecate validateParams
 Key: SPARK-13761
 URL: https://issues.apache.org/jira/browse/SPARK-13761
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13760) Fix BigDecimal constructor for FloatType

2016-03-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186375#comment-15186375
 ] 

Apache Spark commented on SPARK-13760:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/11597

> Fix BigDecimal constructor for FloatType
> 
>
> Key: SPARK-13760
> URL: https://issues.apache.org/jira/browse/SPARK-13760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>Priority: Trivial
>
> Change `BigDecimal.decimal(f: Float)` instead of `BigDecimal(f: float)`. The 
> latter is deprecated and can result in inconsistencies due to an implicit 
> conversion to `Double`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13760) Fix BigDecimal constructor for FloatType

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13760:


Assignee: (was: Apache Spark)

> Fix BigDecimal constructor for FloatType
> 
>
> Key: SPARK-13760
> URL: https://issues.apache.org/jira/browse/SPARK-13760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>Priority: Trivial
>
> Change `BigDecimal.decimal(f: Float)` instead of `BigDecimal(f: float)`. The 
> latter is deprecated and can result in inconsistencies due to an implicit 
> conversion to `Double`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13760) Fix BigDecimal constructor for FloatType

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13760:


Assignee: Apache Spark

> Fix BigDecimal constructor for FloatType
> 
>
> Key: SPARK-13760
> URL: https://issues.apache.org/jira/browse/SPARK-13760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>Priority: Trivial
>
> Change `BigDecimal.decimal(f: Float)` instead of `BigDecimal(f: float)`. The 
> latter is deprecated and can result in inconsistencies due to an implicit 
> conversion to `Double`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13760) Fix BigDecimal constructor for FloatType

2016-03-08 Thread Sameer Agarwal (JIRA)
Sameer Agarwal created SPARK-13760:
--

 Summary: Fix BigDecimal constructor for FloatType
 Key: SPARK-13760
 URL: https://issues.apache.org/jira/browse/SPARK-13760
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Sameer Agarwal
Priority: Trivial


Change `BigDecimal.decimal(f: Float)` instead of `BigDecimal(f: float)`. The 
latter is deprecated and can result in inconsistencies due to an implicit 
conversion to `Double`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12719) SQL generation support for generators (including UDTF)

2016-03-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186349#comment-15186349
 ] 

Apache Spark commented on SPARK-12719:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/11596

> SQL generation support for generators (including UDTF)
> --
>
> Key: SPARK-12719
> URL: https://issues.apache.org/jira/browse/SPARK-12719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{HiveCompatibilitySuite}} can be useful for bootstrapping test coverage. 
> Please refer to SPARK-11012 for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13758) Error message is misleading when RDD refer to null spark context

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13758:


Assignee: Apache Spark

> Error message is misleading when RDD refer to null spark context
> 
>
> Key: SPARK-13758
> URL: https://issues.apache.org/jira/browse/SPARK-13758
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Reporter: Mao, Wei
>Assignee: Apache Spark
>
> We have a recoverable Spark streaming job with checkpoint enabled, it could 
> be executed correctly at first time, but throw following exception when 
> restarted and recovered from checkpoint.
> {noformat}
> org.apache.spark.SparkException: RDD transformations and actions can only be 
> invoked by the driver, not inside of other transformations; for example, 
> rdd1.map(x => rdd2.values.count() * x) is invalid because the values 
> transformation and count action cannot be performed inside of the rdd1.map 
> transformation. For more information, see SPARK-5063.
>   at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
>   at org.apache.spark.rdd.RDD.union(RDD.scala:565)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
> ...
> {noformat}
> According to exception, it shows I invoked transformations and actions in 
> other transformations, but I did not. The real reason is that I used external 
> RDD in DStream operation. External RDD data is not stored in checkpoint, so 
> that during recovering, the initial value of _sc in this RDD is assigned to 
> null and hit above exception.  But you can find the error message is 
> misleading, it indicates nothing about the real issue
> Here is the code to reproduce it.
> {code:java}
> object Repo {
>   def createContext(ip: String, port: Int, checkpointDirectory: 
> String):StreamingContext = {
> println("Creating new context")
> val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]")
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> ssc.checkpoint(checkpointDirectory)
> var cached = ssc.sparkContext.parallelize(Seq("apple, banana"))
> val words = ssc.socketTextStream(ip, port).flatMap(_.split(" "))
> words.foreachRDD((rdd: RDD[String]) => {
>   val res = rdd.map(word => (word, word.length)).collect()
>   println("words: " + res.mkString(", "))
>   cached = cached.union(rdd)
>   cached.checkpoint()
>   println("cached words: " + cached.collect.mkString(", "))
> })
> ssc
>   }
>   def main(args: Array[String]) {
> val ip = "localhost"
> val port = 
> val dir = "/home/maowei/tmp"
> val ssc = StreamingContext.getOrCreate(dir,
>   () => {
> createContext(ip, port, dir)
>   })
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13758) Error message is misleading when RDD refer to null spark context

2016-03-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186337#comment-15186337
 ] 

Apache Spark commented on SPARK-13758:
--

User 'mwws' has created a pull request for this issue:
https://github.com/apache/spark/pull/11595

> Error message is misleading when RDD refer to null spark context
> 
>
> Key: SPARK-13758
> URL: https://issues.apache.org/jira/browse/SPARK-13758
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Reporter: Mao, Wei
>
> We have a recoverable Spark streaming job with checkpoint enabled, it could 
> be executed correctly at first time, but throw following exception when 
> restarted and recovered from checkpoint.
> {noformat}
> org.apache.spark.SparkException: RDD transformations and actions can only be 
> invoked by the driver, not inside of other transformations; for example, 
> rdd1.map(x => rdd2.values.count() * x) is invalid because the values 
> transformation and count action cannot be performed inside of the rdd1.map 
> transformation. For more information, see SPARK-5063.
>   at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
>   at org.apache.spark.rdd.RDD.union(RDD.scala:565)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
> ...
> {noformat}
> According to exception, it shows I invoked transformations and actions in 
> other transformations, but I did not. The real reason is that I used external 
> RDD in DStream operation. External RDD data is not stored in checkpoint, so 
> that during recovering, the initial value of _sc in this RDD is assigned to 
> null and hit above exception.  But you can find the error message is 
> misleading, it indicates nothing about the real issue
> Here is the code to reproduce it.
> {code:java}
> object Repo {
>   def createContext(ip: String, port: Int, checkpointDirectory: 
> String):StreamingContext = {
> println("Creating new context")
> val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]")
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> ssc.checkpoint(checkpointDirectory)
> var cached = ssc.sparkContext.parallelize(Seq("apple, banana"))
> val words = ssc.socketTextStream(ip, port).flatMap(_.split(" "))
> words.foreachRDD((rdd: RDD[String]) => {
>   val res = rdd.map(word => (word, word.length)).collect()
>   println("words: " + res.mkString(", "))
>   cached = cached.union(rdd)
>   cached.checkpoint()
>   println("cached words: " + cached.collect.mkString(", "))
> })
> ssc
>   }
>   def main(args: Array[String]) {
> val ip = "localhost"
> val port = 
> val dir = "/home/maowei/tmp"
> val ssc = StreamingContext.getOrCreate(dir,
>   () => {
> createContext(ip, port, dir)
>   })
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13758) Error message is misleading when RDD refer to null spark context

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13758:


Assignee: (was: Apache Spark)

> Error message is misleading when RDD refer to null spark context
> 
>
> Key: SPARK-13758
> URL: https://issues.apache.org/jira/browse/SPARK-13758
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Reporter: Mao, Wei
>
> We have a recoverable Spark streaming job with checkpoint enabled, it could 
> be executed correctly at first time, but throw following exception when 
> restarted and recovered from checkpoint.
> {noformat}
> org.apache.spark.SparkException: RDD transformations and actions can only be 
> invoked by the driver, not inside of other transformations; for example, 
> rdd1.map(x => rdd2.values.count() * x) is invalid because the values 
> transformation and count action cannot be performed inside of the rdd1.map 
> transformation. For more information, see SPARK-5063.
>   at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
>   at org.apache.spark.rdd.RDD.union(RDD.scala:565)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
>   at 
> org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
> ...
> {noformat}
> According to exception, it shows I invoked transformations and actions in 
> other transformations, but I did not. The real reason is that I used external 
> RDD in DStream operation. External RDD data is not stored in checkpoint, so 
> that during recovering, the initial value of _sc in this RDD is assigned to 
> null and hit above exception.  But you can find the error message is 
> misleading, it indicates nothing about the real issue
> Here is the code to reproduce it.
> {code:java}
> object Repo {
>   def createContext(ip: String, port: Int, checkpointDirectory: 
> String):StreamingContext = {
> println("Creating new context")
> val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]")
> val ssc = new StreamingContext(sparkConf, Seconds(2))
> ssc.checkpoint(checkpointDirectory)
> var cached = ssc.sparkContext.parallelize(Seq("apple, banana"))
> val words = ssc.socketTextStream(ip, port).flatMap(_.split(" "))
> words.foreachRDD((rdd: RDD[String]) => {
>   val res = rdd.map(word => (word, word.length)).collect()
>   println("words: " + res.mkString(", "))
>   cached = cached.union(rdd)
>   cached.checkpoint()
>   println("cached words: " + cached.collect.mkString(", "))
> })
> ssc
>   }
>   def main(args: Array[String]) {
> val ip = "localhost"
> val port = 
> val dir = "/home/maowei/tmp"
> val ssc = StreamingContext.getOrCreate(dir,
>   () => {
> createContext(ip, port, dir)
>   })
> ssc.start()
> ssc.awaitTermination()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7286) Precedence of operator not behaving properly

2016-03-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7286.

   Resolution: Fixed
 Assignee: Jakob Odersky
Fix Version/s: 2.0.0

> Precedence of operator not behaving properly
> 
>
> Key: SPARK-7286
> URL: https://issues.apache.org/jira/browse/SPARK-7286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: DevilJetha
>Assignee: Jakob Odersky
>Priority: Critical
> Fix For: 2.0.0
>
>
> The precedence of the operators ( especially with !== and && ) in Dataframe 
> Columns seems to be messed up.
> Example Snippet
> .where( $"col1" === "val1" && ($"col2"  !== "val2")  ) works fine.
> whereas .where( $"col1" === "val1" && $"col2"  !== "val2"  )
> evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13759) Add IsNotNull constraints for expressions with an inequality

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13759:


Assignee: Apache Spark

> Add IsNotNull constraints for expressions with an inequality
> 
>
> Key: SPARK-13759
> URL: https://issues.apache.org/jira/browse/SPARK-13759
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>Priority: Minor
>
> Support for adding `IsNotNull` constraints from expressions with an 
> inequality. More specifically, if an operator has a condition on `a !== b`, 
> we know that both `a` and `b` in the operator output can no longer be null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13759) Add IsNotNull constraints for expressions with an inequality

2016-03-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186307#comment-15186307
 ] 

Apache Spark commented on SPARK-13759:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/11594

> Add IsNotNull constraints for expressions with an inequality
> 
>
> Key: SPARK-13759
> URL: https://issues.apache.org/jira/browse/SPARK-13759
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Priority: Minor
>
> Support for adding `IsNotNull` constraints from expressions with an 
> inequality. More specifically, if an operator has a condition on `a !== b`, 
> we know that both `a` and `b` in the operator output can no longer be null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13759) Add IsNotNull constraints for expressions with an inequality

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13759:


Assignee: (was: Apache Spark)

> Add IsNotNull constraints for expressions with an inequality
> 
>
> Key: SPARK-13759
> URL: https://issues.apache.org/jira/browse/SPARK-13759
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>Priority: Minor
>
> Support for adding `IsNotNull` constraints from expressions with an 
> inequality. More specifically, if an operator has a condition on `a !== b`, 
> we know that both `a` and `b` in the operator output can no longer be null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13759) Add IsNotNull constraints for expressions with an inequality

2016-03-08 Thread Sameer Agarwal (JIRA)
Sameer Agarwal created SPARK-13759:
--

 Summary: Add IsNotNull constraints for expressions with an 
inequality
 Key: SPARK-13759
 URL: https://issues.apache.org/jira/browse/SPARK-13759
 Project: Spark
  Issue Type: Sub-task
Reporter: Sameer Agarwal
Priority: Minor


Support for adding `IsNotNull` constraints from expressions with an inequality. 
More specifically, if an operator has a condition on `a !== b`, we know that 
both `a` and `b` in the operator output can no longer be null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13754) Keep old data source name for backwards compatibility

2016-03-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-13754:
-
Assignee: Hossein Falaki

> Keep old data source name for backwards compatibility
> -
>
> Key: SPARK-13754
> URL: https://issues.apache.org/jira/browse/SPARK-13754
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hossein Falaki
>Assignee: Hossein Falaki
> Fix For: 2.0.0
>
>
> This data source was contributed by Databricks. It is the inlined version of 
> https://github.com/databricks/spark-csv. The data source name was 
> `com.databricks.spark.csv`. As a result there are many tables created on 
> older versions of spark with that name as the source. For backwards 
> compatibility we should keep the old name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13754) Keep old data source name for backwards compatibility

2016-03-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13754.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11589
[https://github.com/apache/spark/pull/11589]

> Keep old data source name for backwards compatibility
> -
>
> Key: SPARK-13754
> URL: https://issues.apache.org/jira/browse/SPARK-13754
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Hossein Falaki
> Fix For: 2.0.0
>
>
> This data source was contributed by Databricks. It is the inlined version of 
> https://github.com/databricks/spark-csv. The data source name was 
> `com.databricks.spark.csv`. As a result there are many tables created on 
> older versions of spark with that name as the source. For backwards 
> compatibility we should keep the old name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13750) Fix sizeInBytes for HadoopFSRelation

2016-03-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-13750.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11590
[https://github.com/apache/spark/pull/11590]

> Fix sizeInBytes for HadoopFSRelation
> 
>
> Key: SPARK-13750
> URL: https://issues.apache.org/jira/browse/SPARK-13750
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 2.0.0
>
>
> [~davies] reports that {{sizeInBytes}} isnt correct anymore.  We should fix 
> that and make sure there is a test case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13758) Error message is misleading when RDD refer to null spark context

2016-03-08 Thread Mao, Wei (JIRA)
Mao, Wei created SPARK-13758:


 Summary: Error message is misleading when RDD refer to null spark 
context
 Key: SPARK-13758
 URL: https://issues.apache.org/jira/browse/SPARK-13758
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Reporter: Mao, Wei


We have a recoverable Spark streaming job with checkpoint enabled, it could be 
executed correctly at first time, but throw following exception when restarted 
and recovered from checkpoint.
{noformat}
org.apache.spark.SparkException: RDD transformations and actions can only be 
invoked by the driver, not inside of other transformations; for example, 
rdd1.map(x => rdd2.values.count() * x) is invalid because the values 
transformation and count action cannot be performed inside of the rdd1.map 
transformation. For more information, see SPARK-5063.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
at org.apache.spark.rdd.RDD.union(RDD.scala:565)
at 
org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
at 
org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
...
{noformat}
According to exception, it shows I invoked transformations and actions in other 
transformations, but I did not. The real reason is that I used external RDD in 
DStream operation. External RDD data is not stored in checkpoint, so that 
during recovering, the initial value of _sc in this RDD is assigned to null and 
hit above exception.  But you can find the error message is misleading, it 
indicates nothing about the real issue

Here is the code to reproduce it.
{code:java}
object Repo {

  def createContext(ip: String, port: Int, checkpointDirectory: 
String):StreamingContext = {

println("Creating new context")
val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint(checkpointDirectory)

var cached = ssc.sparkContext.parallelize(Seq("apple, banana"))

val words = ssc.socketTextStream(ip, port).flatMap(_.split(" "))
words.foreachRDD((rdd: RDD[String]) => {
  val res = rdd.map(word => (word, word.length)).collect()
  println("words: " + res.mkString(", "))

  cached = cached.union(rdd)
  cached.checkpoint()
  println("cached words: " + cached.collect.mkString(", "))
})
ssc
  }

  def main(args: Array[String]) {

val ip = "localhost"
val port = 
val dir = "/home/maowei/tmp"

val ssc = StreamingContext.getOrCreate(dir,
  () => {
createContext(ip, port, dir)
  })
ssc.start()
ssc.awaitTermination()
  }
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13625) PySpark-ML method to get list of params for an obj should not check property attr

2016-03-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-13625.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11476
[https://github.com/apache/spark/pull/11476]

> PySpark-ML method to get list of params for an obj should not check property 
> attr
> -
>
> Key: SPARK-13625
> URL: https://issues.apache.org/jira/browse/SPARK-13625
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.0.0
>
>
> In PySpark params.__init__.py, the method {{Param.params()}} returns a list 
> of Params belonging to that object.  This method should not check an 
> attribute to be an instance of {{Param}} if it is a property (uses the 
> {{@property}} decorator).  This causes the property to be invoked to 'get' 
> the attribute, and that can lead to an error, depending on the property.  If 
> an attribute is a property it will not be an ML {{Param}}, so no need to 
> check it.
> I came across this in working on SPARK-13430 while adding 
> {{LinearRegressionModel.summary}} as a property to give a training summary, 
> similar to the Scala API.  It is possible that a training summary does not 
> exist and will then raise an exception if the {{summary}} property is 
> invoked.  
> Calling {{getattr(self, x)}} will cause the property to be invoked if {{x}} 
> is a property.  To fix this, just need to check if it a class property before 
> making the call to {{getattr()}} in {{Param.params()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13719) Bad JSON record raises java.lang.ClassCastException

2016-03-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186276#comment-15186276
 ] 

Hyukjin Kwon edited comment on SPARK-13719 at 3/9/16 1:33 AM:
--

 [~rxin] Actually, shouldn't we maybe need parse modes such as 
{{DROPMALFORMED}} or {{PERMISSIVE}} for JSON data source as well so that 
malformed rows can be handled without filtering or pre-processing at user-side?


was (Author: hyukjin.kwon):
 [~rxin] Actually, shouldn't we maybe need modes such as {{DROPMALFORMED}} or 
{{PERMISSIVE}} for JSON data source as well?

> Bad JSON record raises java.lang.ClassCastException
> 
>
> Key: SPARK-13719
> URL: https://issues.apache.org/jira/browse/SPARK-13719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: OS X, Linux
>Reporter: dmtran
>Priority: Minor
>
> I have defined a JSON schema, using org.apache.spark.sql.types.StructType, 
> that expects this kind of record :
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> There's a bad record in my dataset, that defines field "user" as an array, 
> instead of a JSON object :
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> The following exception is raised because of that bad record :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Here's a code snippet that reproduces the exception :
> {noformat}
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.{SQLContext, DataFrame}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> object Snippet {
>   def main(args : Array[String]): Unit = {
> val sc = new SparkContext()
> implicit val sqlContext = new HiveContext(sc)
> val rdd: RDD[String] = sc.parallelize(Seq(badRecord))
> val df: DataFrame = 

[jira] [Comment Edited] (SPARK-13719) Bad JSON record raises java.lang.ClassCastException

2016-03-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186276#comment-15186276
 ] 

Hyukjin Kwon edited comment on SPARK-13719 at 3/9/16 1:34 AM:
--

 [~rxin] Actually, shouldn't we maybe need parse modes such as 
{{DROPMALFORMED}} or {{PERMISSIVE}} at CSV data source, for JSON data source as 
well so that malformed rows can be handled without filtering or pre-processing 
at user-side?


was (Author: hyukjin.kwon):
 [~rxin] Actually, shouldn't we maybe need parse modes such as 
{{DROPMALFORMED}} or {{PERMISSIVE}} for JSON data source as well so that 
malformed rows can be handled without filtering or pre-processing at user-side?

> Bad JSON record raises java.lang.ClassCastException
> 
>
> Key: SPARK-13719
> URL: https://issues.apache.org/jira/browse/SPARK-13719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: OS X, Linux
>Reporter: dmtran
>Priority: Minor
>
> I have defined a JSON schema, using org.apache.spark.sql.types.StructType, 
> that expects this kind of record :
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> There's a bad record in my dataset, that defines field "user" as an array, 
> instead of a JSON object :
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> The following exception is raised because of that bad record :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Here's a code snippet that reproduces the exception :
> {noformat}
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.{SQLContext, DataFrame}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> object Snippet {
>   def main(args : Array[String]): Unit = {
> val sc = new SparkContext()
> implicit val sqlContext = new HiveContext(sc)
> 

[jira] [Commented] (SPARK-13719) Bad JSON record raises java.lang.ClassCastException

2016-03-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186276#comment-15186276
 ] 

Hyukjin Kwon commented on SPARK-13719:
--

 [~rxin] Actually, shouldn't we maybe need modes such as {{DROPMALFORMED}} or 
{{PERMISSIVE}} for JSON data source as well?

> Bad JSON record raises java.lang.ClassCastException
> 
>
> Key: SPARK-13719
> URL: https://issues.apache.org/jira/browse/SPARK-13719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
> Environment: OS X, Linux
>Reporter: dmtran
>Priority: Minor
>
> I have defined a JSON schema, using org.apache.spark.sql.types.StructType, 
> that expects this kind of record :
> {noformat}
> {
>   "request": {
> "user": {
>   "id": 123
> }
>   }
> }
> {noformat}
> There's a bad record in my dataset, that defines field "user" as an array, 
> instead of a JSON object :
> {noformat}
> {
>   "request": {
> "user": []
>   }
> }
> {noformat}
> The following exception is raised because of that bad record :
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: 
> Lost task 7.3 in stage 0.0 (TID 10, 192.168.1.170): 
> java.lang.ClassCastException: org.apache.spark.sql.types.GenericArrayData 
> cannot be cast to org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:50)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:247)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:67)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:117)
>   at 
> org.apache.spark.sql.execution.Filter$$anonfun$4$$anonfun$apply$4.apply(basicOperators.scala:115)
>   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:390)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:97)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Here's a code snippet that reproduces the exception :
> {noformat}
> import org.apache.spark.SparkContext
> import org.apache.spark.rdd.RDD
> import org.apache.spark.sql.{SQLContext, DataFrame}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> object Snippet {
>   def main(args : Array[String]): Unit = {
> val sc = new SparkContext()
> implicit val sqlContext = new HiveContext(sc)
> val rdd: RDD[String] = sc.parallelize(Seq(badRecord))
> val df: DataFrame = sqlContext.read.schema(schema).json(rdd)
> import sqlContext.implicits._
> df.select("request.user.id")
>   .filter($"id".isNotNull)
>   .count()
>   }
>   val badRecord =
> s"""{
> |  "request": {
> |"user": []
> |  }
> |}""".stripMargin.replaceAll("\n", " ") // 

[jira] [Commented] (SPARK-13660) CommitFailureTestRelationSuite floods the logs with garbage

2016-03-08 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186245#comment-15186245
 ] 

Shixiong Zhu commented on SPARK-13660:
--

Just go ahead!

> CommitFailureTestRelationSuite floods the logs with garbage
> ---
>
> Key: SPARK-13660
> URL: https://issues.apache.org/jira/browse/SPARK-13660
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>  Labels: starter
>
> https://github.com/apache/spark/pull/11439 added a utility method 
> "testQuietly". We can use it for CommitFailureTestRelationSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13593) improve the `createDataFrame` method to accept data type string and verify the data

2016-03-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-13593:
-
Assignee: Wenchen Fan

> improve the `createDataFrame` method to accept data type string and verify 
> the data
> ---
>
> Key: SPARK-13593
> URL: https://issues.apache.org/jira/browse/SPARK-13593
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13757) support quoted column names in schema string at types.py#_parse_datatype_string

2016-03-08 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-13757:
---

 Summary: support quoted column names in schema string at 
types.py#_parse_datatype_string
 Key: SPARK-13757
 URL: https://issues.apache.org/jira/browse/SPARK-13757
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13755) Escape quotes in SQL plan visualization node labels

2016-03-08 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13755.

   Resolution: Fixed
Fix Version/s: 1.6.2
   2.0.0

Issue resolved by pull request 11587
[https://github.com/apache/spark/pull/11587]

> Escape quotes in SQL plan visualization node labels
> ---
>
> Key: SPARK-13755
> URL: https://issues.apache.org/jira/browse/SPARK-13755
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0, 1.6.2
>
>
> When generating Graphviz DOT files in the SQL query visualization we need to 
> escape double-quotes inside node labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13728) Fix ORC PPD

2016-03-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186166#comment-15186166
 ] 

Apache Spark commented on SPARK-13728:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/11593

> Fix ORC PPD
> ---
>
> Key: SPARK-13728
> URL: https://issues.apache.org/jira/browse/SPARK-13728
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Hyukjin Kwon
>
> Fix the ignored test "Enable ORC PPD" in OrcQuerySuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13728) Fix ORC PPD

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13728:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Fix ORC PPD
> ---
>
> Key: SPARK-13728
> URL: https://issues.apache.org/jira/browse/SPARK-13728
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Hyukjin Kwon
>
> Fix the ignored test "Enable ORC PPD" in OrcQuerySuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13728) Fix ORC PPD

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13728:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Fix ORC PPD
> ---
>
> Key: SPARK-13728
> URL: https://issues.apache.org/jira/browse/SPARK-13728
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> Fix the ignored test "Enable ORC PPD" in OrcQuerySuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13242) Moderately complex `when` expression causes code generation failure

2016-03-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186128#comment-15186128
 ] 

Apache Spark commented on SPARK-13242:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11592

> Moderately complex `when` expression causes code generation failure
> ---
>
> Key: SPARK-13242
> URL: https://issues.apache.org/jira/browse/SPARK-13242
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Joe Halliwell
>
> Moderately complex `when` expressions produce generated code that busts the 
> 64KB method limit. This causes code generation to fail.
> Here's a test case exhibiting the problem: 
> https://github.com/joehalliwell/spark/commit/4dbdf6e15d1116b8e1eb44822fd29ead9b7d817d
> I'm interested in working on a fix. I'm thinking it may be possible to split 
> the expressions along the lines of SPARK-8443, but any pointers would be 
> welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13660) CommitFailureTestRelationSuite floods the logs with garbage

2016-03-08 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186103#comment-15186103
 ] 

Xin Ren commented on SPARK-13660:
-

Hi Shixiong I'd like to have a try on this one :)

> CommitFailureTestRelationSuite floods the logs with garbage
> ---
>
> Key: SPARK-13660
> URL: https://issues.apache.org/jira/browse/SPARK-13660
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>  Labels: starter
>
> https://github.com/apache/spark/pull/11439 added a utility method 
> "testQuietly". We can use it for CommitFailureTestRelationSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks

2016-03-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-13668:
-
Assignee: Sameer Agarwal

> Reorder filter/join predicates to short-circuit isNotNull checks
> 
>
> Key: SPARK-13668
> URL: https://issues.apache.org/jira/browse/SPARK-13668
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>Assignee: Sameer Agarwal
>Priority: Minor
> Fix For: 2.0.0
>
>
> If a filter predicate or a join condition consists of `IsNotNull` checks, we 
> should reorder these checks such that these non-nullability checks are 
> evaluated before the rest of the predicates.
> For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we 
> should rewrite this as `isNotNull(b) && a > 5` during physical plan 
> generation.
> cc [~nongli] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13668) Reorder filter/join predicates to short-circuit isNotNull checks

2016-03-08 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-13668.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11511
[https://github.com/apache/spark/pull/11511]

> Reorder filter/join predicates to short-circuit isNotNull checks
> 
>
> Key: SPARK-13668
> URL: https://issues.apache.org/jira/browse/SPARK-13668
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Sameer Agarwal
>Priority: Minor
> Fix For: 2.0.0
>
>
> If a filter predicate or a join condition consists of `IsNotNull` checks, we 
> should reorder these checks such that these non-nullability checks are 
> evaluated before the rest of the predicates.
> For e.g., if a filter predicate is of the form `a > 5 && isNotNull(b)`, we 
> should rewrite this as `isNotNull(b) && a > 5` during physical plan 
> generation.
> cc [~nongli] [~yhuai]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2016-03-08 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186074#comment-15186074
 ] 

Gayathri Murali commented on SPARK-12664:
-

[~yanboliang] Are you working on this? If not, Can I work on this?

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13744) Dataframe RDD caching increases the input size for subsequent stages

2016-03-08 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186058#comment-15186058
 ] 

Stavros Kontopoulos edited comment on SPARK-13744 at 3/8/16 11:27 PM:
--

I understand that, my question is the following i run the same example on 
master (not 1.6 i will though) along with a bit more counts (see stages image 
attached):

parquetFile.rdd.count()
parquetFile.rdd.cache()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()

Why the third count shows as input 216 mb shouldnt be 359 already? The second 
count should cache everything right?


was (Author: skonto):
I understand that, my question is the following i run the same example on 
master (not 1.6) along with a bit more counts (see stages image attached):

parquetFile.rdd.count()
parquetFile.rdd.cache()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()

Why the third count shows as input 216 mb shouldnt be 359 already? The second 
count should cache everything right?

> Dataframe RDD caching increases the input size for subsequent stages
> 
>
> Key: SPARK-13744
> URL: https://issues.apache.org/jira/browse/SPARK-13744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 1.6.0
> Environment: OSX
>Reporter: Justin Pihony
>Priority: Minor
> Attachments: Screen Shot 2016-03-08 at 10.35.51 AM.png, stages.png
>
>
> Given the below code, you will see that the first run of count shows up as 
> ~90KB, and even the next run with cache being set will result in the same 
> input size. However, every subsequent run thereafter will result in an input 
> size that is MUCH larger (500MB is listed as 38% for a default run). This 
> size discrepancy seems to be a bug in the caching of a dataframe's RDD as far 
> as I can see. 
> {code}
> import sqlContext.implicits._
> case class Person(name:String ="Test", number:Double = 1000.2)
> val people = sc.parallelize(1 to 1000,50).map { p => Person()}.toDF
> people.write.parquet("people.parquet")
> val parquetFile = sqlContext.read.parquet("people.parquet")
> parquetFile.rdd.count()
> parquetFile.rdd.cache()
> parquetFile.rdd.count()
> parquetFile.rdd.count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13744) Dataframe RDD caching increases the input size for subsequent stages

2016-03-08 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186058#comment-15186058
 ] 

Stavros Kontopoulos edited comment on SPARK-13744 at 3/8/16 11:27 PM:
--

I understand that, my question is the following i run the same example on 
master (not 1.6) along with a bit more counts (see stages image attached):

parquetFile.rdd.count()
parquetFile.rdd.cache()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()

Why the third count shows as input 216 mb shouldnt be 359 already? The second 
count should cache everything right?


was (Author: skonto):
I understand that, my question is the following i run the same example on 
master along with a bit more counts (see stages image attached):

parquetFile.rdd.count()
parquetFile.rdd.cache()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()

Why the third count shows as input 216 mb shouldnt be 359 already? The second 
count should cache everything right?

> Dataframe RDD caching increases the input size for subsequent stages
> 
>
> Key: SPARK-13744
> URL: https://issues.apache.org/jira/browse/SPARK-13744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 1.6.0
> Environment: OSX
>Reporter: Justin Pihony
>Priority: Minor
> Attachments: Screen Shot 2016-03-08 at 10.35.51 AM.png, stages.png
>
>
> Given the below code, you will see that the first run of count shows up as 
> ~90KB, and even the next run with cache being set will result in the same 
> input size. However, every subsequent run thereafter will result in an input 
> size that is MUCH larger (500MB is listed as 38% for a default run). This 
> size discrepancy seems to be a bug in the caching of a dataframe's RDD as far 
> as I can see. 
> {code}
> import sqlContext.implicits._
> case class Person(name:String ="Test", number:Double = 1000.2)
> val people = sc.parallelize(1 to 1000,50).map { p => Person()}.toDF
> people.write.parquet("people.parquet")
> val parquetFile = sqlContext.read.parquet("people.parquet")
> parquetFile.rdd.count()
> parquetFile.rdd.cache()
> parquetFile.rdd.count()
> parquetFile.rdd.count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13578) Make launcher lib and user scripts handle jar directories instead of single assembly file

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13578:


Assignee: (was: Apache Spark)

> Make launcher lib and user scripts handle jar directories instead of single 
> assembly file
> -
>
> Key: SPARK-13578
> URL: https://issues.apache.org/jira/browse/SPARK-13578
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>
> See parent bug for details. This step is necessary before we can remove the 
> assembly from the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13578) Make launcher lib and user scripts handle jar directories instead of single assembly file

2016-03-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186057#comment-15186057
 ] 

Apache Spark commented on SPARK-13578:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/11591

> Make launcher lib and user scripts handle jar directories instead of single 
> assembly file
> -
>
> Key: SPARK-13578
> URL: https://issues.apache.org/jira/browse/SPARK-13578
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>
> See parent bug for details. This step is necessary before we can remove the 
> assembly from the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13744) Dataframe RDD caching increases the input size for subsequent stages

2016-03-08 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186058#comment-15186058
 ] 

Stavros Kontopoulos commented on SPARK-13744:
-

I understand that, my question is the following i run the same example on 
master along with a bit more counts (see stages image attached):

parquetFile.rdd.count()
parquetFile.rdd.cache()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()
parquetFile.rdd.count()

Why the third count shows as input 216 mb shouldnt be 359 already? The second 
count should cache everything right?

> Dataframe RDD caching increases the input size for subsequent stages
> 
>
> Key: SPARK-13744
> URL: https://issues.apache.org/jira/browse/SPARK-13744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 1.6.0
> Environment: OSX
>Reporter: Justin Pihony
>Priority: Minor
> Attachments: Screen Shot 2016-03-08 at 10.35.51 AM.png, stages.png
>
>
> Given the below code, you will see that the first run of count shows up as 
> ~90KB, and even the next run with cache being set will result in the same 
> input size. However, every subsequent run thereafter will result in an input 
> size that is MUCH larger (500MB is listed as 38% for a default run). This 
> size discrepancy seems to be a bug in the caching of a dataframe's RDD as far 
> as I can see. 
> {code}
> import sqlContext.implicits._
> case class Person(name:String ="Test", number:Double = 1000.2)
> val people = sc.parallelize(1 to 1000,50).map { p => Person()}.toDF
> people.write.parquet("people.parquet")
> val parquetFile = sqlContext.read.parquet("people.parquet")
> parquetFile.rdd.count()
> parquetFile.rdd.cache()
> parquetFile.rdd.count()
> parquetFile.rdd.count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13578) Make launcher lib and user scripts handle jar directories instead of single assembly file

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13578:


Assignee: Apache Spark

> Make launcher lib and user scripts handle jar directories instead of single 
> assembly file
> -
>
> Key: SPARK-13578
> URL: https://issues.apache.org/jira/browse/SPARK-13578
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> See parent bug for details. This step is necessary before we can remove the 
> assembly from the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13750) Fix sizeInBytes for HadoopFSRelation

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13750:


Assignee: Apache Spark  (was: Davies Liu)

> Fix sizeInBytes for HadoopFSRelation
> 
>
> Key: SPARK-13750
> URL: https://issues.apache.org/jira/browse/SPARK-13750
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Blocker
>
> [~davies] reports that {{sizeInBytes}} isnt correct anymore.  We should fix 
> that and make sure there is a test case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13750) Fix sizeInBytes for HadoopFSRelation

2016-03-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13750:


Assignee: Davies Liu  (was: Apache Spark)

> Fix sizeInBytes for HadoopFSRelation
> 
>
> Key: SPARK-13750
> URL: https://issues.apache.org/jira/browse/SPARK-13750
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Blocker
>
> [~davies] reports that {{sizeInBytes}} isnt correct anymore.  We should fix 
> that and make sure there is a test case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13750) Fix sizeInBytes for HadoopFSRelation

2016-03-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15186055#comment-15186055
 ] 

Apache Spark commented on SPARK-13750:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11590

> Fix sizeInBytes for HadoopFSRelation
> 
>
> Key: SPARK-13750
> URL: https://issues.apache.org/jira/browse/SPARK-13750
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Davies Liu
>Priority: Blocker
>
> [~davies] reports that {{sizeInBytes}} isnt correct anymore.  We should fix 
> that and make sure there is a test case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13744) Dataframe RDD caching increases the input size for subsequent stages

2016-03-08 Thread Stavros Kontopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-13744:

Attachment: stages.png

> Dataframe RDD caching increases the input size for subsequent stages
> 
>
> Key: SPARK-13744
> URL: https://issues.apache.org/jira/browse/SPARK-13744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 1.6.0
> Environment: OSX
>Reporter: Justin Pihony
>Priority: Minor
> Attachments: Screen Shot 2016-03-08 at 10.35.51 AM.png, stages.png
>
>
> Given the below code, you will see that the first run of count shows up as 
> ~90KB, and even the next run with cache being set will result in the same 
> input size. However, every subsequent run thereafter will result in an input 
> size that is MUCH larger (500MB is listed as 38% for a default run). This 
> size discrepancy seems to be a bug in the caching of a dataframe's RDD as far 
> as I can see. 
> {code}
> import sqlContext.implicits._
> case class Person(name:String ="Test", number:Double = 1000.2)
> val people = sc.parallelize(1 to 1000,50).map { p => Person()}.toDF
> people.write.parquet("people.parquet")
> val parquetFile = sqlContext.read.parquet("people.parquet")
> parquetFile.rdd.count()
> parquetFile.rdd.cache()
> parquetFile.rdd.count()
> parquetFile.rdd.count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >