[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree

2021-06-07 Thread Carlos Bribiescas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359064#comment-17359064
 ] 

Carlos Bribiescas commented on SPARK-3159:
--

[~xujiajin] Have added this request 
https://issues.apache.org/jira/browse/SPARK-34591 to address if you want to 
follow.

> Check for reducible DecisionTree
> 
>
> Key: SPARK-3159
> URL: https://issues.apache.org/jira/browse/SPARK-3159
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Alessandro Solimando
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2020-05-24-23-00-38-419.png
>
>
> Improvement: test-time computation
> Currently, pairs of leaf nodes with the same parent can both output the same 
> prediction.  This happens since the splitting criterion (e.g., Gini) is not 
> the same as prediction accuracy/MSE; the splitting criterion can sometimes be 
> improved even when both children would still output the same prediction 
> (e.g., based on the majority label for classification).
> We could check the tree and reduce it if possible after training.
> Note: This happens with scikit-learn as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3159) Check for reducible DecisionTree

2021-03-01 Thread Carlos Bribiescas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293294#comment-17293294
 ] 

Carlos Bribiescas edited comment on SPARK-3159 at 3/2/21, 2:20 AM:
---

Would setting `setCanMergeChildren` in construction disable this?  Or am i 
reading the PR wrong?


was (Author: cbribiescas):
Would setting `canMergeChildren` in construction disable this?  Or am i reading 
the PR wrong?

> Check for reducible DecisionTree
> 
>
> Key: SPARK-3159
> URL: https://issues.apache.org/jira/browse/SPARK-3159
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Alessandro Solimando
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2020-05-24-23-00-38-419.png
>
>
> Improvement: test-time computation
> Currently, pairs of leaf nodes with the same parent can both output the same 
> prediction.  This happens since the splitting criterion (e.g., Gini) is not 
> the same as prediction accuracy/MSE; the splitting criterion can sometimes be 
> improved even when both children would still output the same prediction 
> (e.g., based on the majority label for classification).
> We could check the tree and reduce it if possible after training.
> Note: This happens with scikit-learn as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree

2021-03-01 Thread Carlos Bribiescas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293294#comment-17293294
 ] 

Carlos Bribiescas commented on SPARK-3159:
--

Would setting `canMergeChildren` in construction disable this?  Or am i reading 
the PR wrong?

> Check for reducible DecisionTree
> 
>
> Key: SPARK-3159
> URL: https://issues.apache.org/jira/browse/SPARK-3159
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Alessandro Solimando
>Priority: Minor
> Fix For: 2.4.0
>
> Attachments: image-2020-05-24-23-00-38-419.png
>
>
> Improvement: test-time computation
> Currently, pairs of leaf nodes with the same parent can both output the same 
> prediction.  This happens since the splitting criterion (e.g., Gini) is not 
> the same as prediction accuracy/MSE; the splitting criterion can sometimes be 
> improved even when both children would still output the same prediction 
> (e.g., based on the majority label for classification).
> We could check the tree and reduce it if possible after training.
> Note: This happens with scikit-learn as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2018-04-17 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441257#comment-16441257
 ] 

Carlos Bribiescas commented on SPARK-21063:
---

Any update or workarounds for this?

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>Priority: Major
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-25 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218465#comment-16218465
 ] 

Carlos Bribiescas edited comment on SPARK-22335 at 10/25/17 12:26 PM:
--

[~viirya] That is definitely a work around.  And effectively Union will 
sometimes work correctly for typed things.  Unlike in RDD which is also type, 
but works consistently.  I guess this is just a point where two different 
concepts are at ends.  I think this was what [~dongjoon] was getting at.  
(maybe?)  

Suppose you're working with a traditional typed data structure.  It has an 
operation that worked with case classes only sometimes, depending on how they 
were constructed.  Like, if you created this case class via reflection, then 
flat map won’t work correctly for some reason that can be explained and 
documented.

I also understand that SQL does unions by column order, and thats how union 
traditionally works in that space.

So we have two things which aren't Spark, but which Spark is inspired by, that 
would do two operations in a slightly different way.  Updating documentation is 
definitely a good step at making the api more useable, but ultimately I guess 
the decision is to go, in this case, with a more SQL-like approach rather than 
real typing.


was (Author: cbribiescas):
[~viirya] That is definitely a work around.  And effectively Union will 
sometimes work correctly for typed things.  Unlike in RDD which is also type, 
but works consistently.  I guess this is just a point where two different 
concepts are at ends.  I think this was what [~dongjoon] was getting at.  
(maybe?)  

Suppose you're working with a traditional typed data structure.  It has an 
operation that worked with case classes only sometimes, depending on how they 
were constructed.  Like, if you created this case class via reflection, then 
flat map won’t work correctly for some reason that can be explained and 
documented.

I also understand that SQL does unions by column order, and thats why thats how 
union traditionally works.

So we have two things which aren't spark, but which spark is inspired by that 
would do two things in a slightly different way.  Updating documentation is 
definitely a good step at making the api more useable, but ultimately I guess 
the decision is to go, in this case, with a more SQL-like approach rather than 
real typing.

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done w

[jira] [Comment Edited] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-25 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218465#comment-16218465
 ] 

Carlos Bribiescas edited comment on SPARK-22335 at 10/25/17 12:21 PM:
--

[~viirya] That is definitely a work around.  And effectively Union will 
sometimes work correctly for typed things.  Unlike in RDD which is also type, 
but works consistently.  I guess this is just a point where two different 
concepts are at ends.  I think this was what [~dongjoon] was getting at.  
(maybe?)  

Suppose you're working with a traditional typed data structure.  It has an 
operation that worked with case classes only sometimes, depending on how they 
were constructed.  Like, if you created this case class via reflection, then 
flat map won’t work correctly for some reason that can be explained and 
documented.

I also understand that SQL does unions by column order, and thats why thats how 
union traditionally works.

So we have two things which aren't spark, but which spark is inspired by that 
would do two things in a slightly different way.  Updating documentation is 
definitely a good step at making the api more useable, but ultimately I guess 
the decision is to go, in this case, with a more SQL-like approach rather than 
real typing.


was (Author: cbribiescas):
[~viirya] That is definitely a work around.  And effectively Union will 
sometimes work correctly for typed things.  Unlike in RDD which is also typed 
and also works consistently.

I guess this is just a point where two different concepts are at ends.  I think 
this was what [~dongjoon] was getting at.  The traditional notion of a DS/DF is 
at ends with the be implied ability of working with a typed data structure. 

 Consider you were working with any collection.  Then the api for that data 
structure had operations that worked with case classes only sometimes, 
depending on how they were constructed.  Like, if you created this case class 
via reflection, then flat map won’t work correctly for some reason that can be 
explained and documented.

Updating documentation is definitely a good step at making the api more 
useable, it’s just that my feeling is that a typed object is a typed no matter 
what.  I also understand that a Datasets implementation is strongly tied to row 
order as well.

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..




[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-25 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218465#comment-16218465
 ] 

Carlos Bribiescas commented on SPARK-22335:
---

[~viirya] That is definitely a work around.  And effectively Union will 
sometimes work correctly for typed things.  Unlike in RDD which is also typed 
and also works consistently.

I guess this is just a point where two different concepts are at ends.  I think 
this was what [~dongjoon] was getting at.  The traditional notion of a DS/DF is 
at ends with the be implied ability of working with a typed data structure. 

 Consider you were working with any collection.  Then the api for that data 
structure had operations that worked with case classes only sometimes, 
depending on how they were constructed.  Like, if you created this case class 
via reflection, then flat map won’t work correctly for some reason that can be 
explained and documented.

Updating documentation is definitely a good step at making the api more 
useable, it’s just that my feeling is that a typed object is a typed no matter 
what.  I also understand that a Datasets implementation is strongly tied to row 
order as well.

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-24 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217497#comment-16217497
 ] 

Carlos Bribiescas commented on SPARK-22335:
---

I'm not sure I understand what you're asking.  I do agree that DS should be 
consistent with DF when possible, but in this case the more specific 
functionality (typing) doesn't apply to DF.   Again, sorry if I didn't answer 
your question I didn't quite get what you were asking.  Can you clarify?

Here is another example that maybe helps.

{code:java}

  case class AB(a : String, b : Int)

  val abDs = sc.parallelize(List(("aThing",0))).toDF("a", "b").as[AB]
  val baDs = sc.parallelize(List((0,"aThing"))).toDF("b", "a").as[AB]
  
  abDs.show() // works
  baDs.show() // works
  
  abDs.union(baDs).show() // Real error to do with types
  abDs.rdd.union(baDs.rdd).toDF().as[AB].show() // Works which is inconsistent 
with last statement IMO
{code}


> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21043) Add unionByName API to Dataset

2017-10-24 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216953#comment-16216953
 ] 

Carlos Bribiescas commented on SPARK-21043:
---

I really like this feature.  Is there a motivation not to replace union with 
this functionality, other than backwards compatibility?

> Add unionByName API to Dataset
> --
>
> Key: SPARK-21043
> URL: https://issues.apache.org/jira/browse/SPARK-21043
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> It would be useful to add unionByName which resolves columns by name, in 
> addition to the existing union (which resolves by position).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-24 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216950#comment-16216950
 ] 

Carlos Bribiescas commented on SPARK-22335:
---

I think if unionByName replaced union then it would be a solution.   Its 
definitely a workaround... But as the api stands it feels like a bug since 
Dataset is supposed to be typed.

Again, I suspect it has to do with the optimizer pushing the typing to a later 
step, after the union by column order happens.  If this is the root cause of 
the bug I worry how else its being manifested, that is, what other bugs it may 
cause.  I'll have to think about it a bit more.



> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
>   abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives 
> correct result
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-23 Thread Carlos Bribiescas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Bribiescas updated SPARK-22335:
--
Description: 
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too

  baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround 
for linked issue, slightly modified.  However this seems wrong since its 
supposed to be strongly typed
  
  baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
result, which is logically inconsistent behavior
  abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives correct 
result
{code}

So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
around is really fair, since I'm supposed to be getting of type `AB`.  More 
importantly I think the issue is bigger when you consider that it happens even 
if you read from parquet (as you would expect).  And that its inconsistent when 
going to/from rdd.

I imagine its just lazily converting to typed DS instead of initially.  So 
either that typing could be prioritized to happen before the union or unioning 
of DF could be done with column order taken into account.  Again, this is 
speculation..

  was:
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too

  baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround 
for linked issue, slightly modified.  However this seems wrong since its 
supposed to be strongly typed
  
  baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
result, which is logically inconsistent behavior
{code}

So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
around is really fair, since I'm supposed to be getting of type `AB`.  More 
importantly I think the issue is bigger when you consider that it happens even 
if you read from parquet (as you would expect).  And that its inconsistent when 
going to/from rdd.

I imagine its just lazily converting to typed DS instead of initially.  So 
either that typing could be prioritized to happen before the union or unioning 
of DF could be done with column order taken into account.  Again, this is 
speculation..


> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use

[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-23 Thread Carlos Bribiescas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Bribiescas updated SPARK-22335:
--
Priority: Major  (was: Minor)

> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
>   
>   baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
> result, which is logically inconsistent behavior
> {code}
> So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
> around is really fair, since I'm supposed to be getting of type `AB`.  More 
> importantly I think the issue is bigger when you consider that it happens 
> even if you read from parquet (as you would expect).  And that its 
> inconsistent when going to/from rdd.
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that typing could be prioritized to happen before the union or 
> unioning of DF could be done with column order taken into account.  Again, 
> this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-23 Thread Carlos Bribiescas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Bribiescas updated SPARK-22335:
--
Description: 
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too

  baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround 
for linked issue, slightly modified.  However this seems wrong since its 
supposed to be strongly typed
  
  baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
result, which is logically inconsistent behavior
{code}

So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
around is really fair, since I'm supposed to be getting of type `AB`.  More 
importantly I think the issue is bigger when you consider that it happens even 
if you read from parquet (as you would expect).  And that its inconsistent when 
going to/from rdd.

I imagine its just lazily converting to typed DS instead of initially.  So 
either that typing could be prioritized to happen before the union or unioning 
of DF could be done with column order taken into account.  Again, this is 
speculation..

  was:
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too

  baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround 
for linked issue, slightly modified.  However this seems wrong since its 
supposed to be strongly typed
  
  baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
result, which is logically inconsistent behavior
{code}

So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
around is really fair, since I'm supposed to be getting of type `AB`.  More 
importantly I think the issue is bigger when you consider that it happens even 
if you read from parquet (as you would expect).  And that its inconsistent when 
going to/from rdd.

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..


> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>Priority: Minor
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b

[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-23 Thread Carlos Bribiescas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Bribiescas updated SPARK-22335:
--
Description: 
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too

  baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround 
for linked issue, slightly modified.  However this seems wrong since its 
supposed to be strongly typed
  
  baDs.rdd.toDF().as[AB].union(abDs).show()  // This however gives correct 
result, which is logically inconsistent behavior
{code}

So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
around is really fair, since I'm supposed to be getting of type `AB`.  More 
importantly I think the issue is bigger when you consider that it happens even 
if you read from parquet (as you would expect).  And that its inconsistent when 
going to/from rdd.

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..

  was:
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too

  baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround 
for linked issue, slightly modified.  However this seems wrong since its 
supposed to be strongly typed
{code}

So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
around is really fair, since I'm supposed to be getting of type `AB`

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..


> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>Priority: Minor
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong re

[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-23 Thread Carlos Bribiescas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Bribiescas updated SPARK-22335:
--
Description: 
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too

  baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround 
for linked issue, slightly modified.  However this seems wrong since its 
supposed to be strongly typed
{code}

So its inconsistent and a bug IMO.  And I'm not sure that the suggested work 
around is really fair, since I'm supposed to be getting of type `AB`

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..

  was:
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too

  baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround 
for linked issue, slightly modified.  However this seems wrong since its 
supposed to be strongly typed
{code}

So its inconsistent and a bug IMO.  And I'm not sure of a workaround if you get 
handed a DS witho

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..


> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>Priority: Minor
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result

[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-23 Thread Carlos Bribiescas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Bribiescas updated SPARK-22335:
--
Description: 
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too

  baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround 
for linked issue, slightly modified.  However this seems wrong since its 
supposed to be strongly typed
{code}

So its inconsistent and a bug IMO.  And I'm not sure of a workaround if you get 
handed a DS witho

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..

  was:
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too
{code}

So its inconsistent and a bug IMO.  

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..


> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>Priority: Minor
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
>   baDs.select("a","b").as[AB].union(abDs).show() // This is the same 
> workaround for linked issue, slightly modified.  However this seems wrong 
> since its supposed to be strongly typed
> {code}
> So

[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-23 Thread Carlos Bribiescas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Bribiescas updated SPARK-22335:
--
Description: 
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
should be correctly mapped by type, not by column order
  
  abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too
{code}

So its inconsistent and a bug IMO.  

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..

  was:
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too
{code}

So its inconsistent and a bug IMO.  

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..


> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>Priority: Minor
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as linked ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()  // This gives wrong result since a Dataset[AB] 
> should be correctly mapped by type, not by column order
>   
>   abDs.union(baDs).map(_.a).show() // This gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
> {code}
> So its inconsistent and a bug IMO.  
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that could be prioritized or unioning of DF could be done with column 
> order taken into account.  Again, this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-23 Thread Carlos Bribiescas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Bribiescas updated SPARK-22335:
--
Description: 
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too
{code}

So its inconsistent and a bug IMO.  

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..

  was:
This isn't quite the issue I'm facing, but solving this issue will fix my 
issue. (probably)
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too
{code}

So its inconsistent and a bug IMO.  

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..


> Union for DataSet uses column order instead of types for union
> --
>
> Key: SPARK-22335
> URL: https://issues.apache.org/jira/browse/SPARK-22335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Carlos Bribiescas
>Priority: Minor
>
> I see union uses column order for a DF. This to me is "fine" since they 
> aren't typed.
> However, for a dataset which is supposed to be strongly typed it is actually 
> giving the wrong result. If you try to access the members by name, it will 
> use the order. Heres is a reproducible case. 2.2.0
> {code:java}
>   case class AB(a : String, b : String)
>   val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
>   val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
>   
>   abDf.union(baDf).show() // as this ticket states, its "Not a problem"
>   
>   val abDs = abDf.as[AB]
>   val baDs = baDf.as[AB]
>   
>   abDs.union(baDs).show()
>   
>   abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
> Dataset[AB] should be correctly mapped by type, not by column order
>abDs.union(baDs).rdd.take(2) // This also gives wrong result
>   baDs.map(_.a).show() // However, this gives the correct result, even though 
> columns were out of order.
>   abDs.map(_.a).show() // This is correct too
> {code}
> So its inconsistent and a bug IMO.  
> I imagine its just lazily converting to typed DS instead of initially.  So 
> either that could be prioritized or unioning of DF could be done with column 
> order taken into account.  Again, this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22335) Union for DataSet uses column order instead of types for union

2017-10-23 Thread Carlos Bribiescas (JIRA)
Carlos Bribiescas created SPARK-22335:
-

 Summary: Union for DataSet uses column order instead of types for 
union
 Key: SPARK-22335
 URL: https://issues.apache.org/jira/browse/SPARK-22335
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Carlos Bribiescas
Priority: Minor


This isn't quite the issue I'm facing, but solving this issue will fix my 
issue. (probably)
I see union uses column order for a DF. This to me is "fine" since they aren't 
typed.
However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result. If you try to access the members by name, it will use 
the order. Heres is a reproducible case. 2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

   abDs.union(baDs).rdd.take(2) // This also gives wrong result

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // This is correct too
{code}

So its inconsistent and a bug IMO.  

I imagine its just lazily converting to typed DS instead of initially.  So 
either that could be prioritized or unioning of DF could be done with column 
order taken into account.  Again, this is speculation..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20761) Union uses column order rather than schema

2017-10-23 Thread Carlos Bribiescas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Bribiescas updated SPARK-20761:
--
Comment: was deleted

(was: This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // So does this
{code}


So its inconsistent IMO.)

> Union uses column order rather than schema
> --
>
> Key: SPARK-20761
> URL: https://issues.apache.org/jira/browse/SPARK-20761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nakul Jeirath
>Priority: Minor
>
> I believe there is an issue when using union to combine two dataframes when 
> the order of columns differ between the left and right side of the union:
> {code}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{BooleanType, StringType, StructField, 
> StructType}
> val schema = StructType(Seq(
>   StructField("id", StringType, false),
>   StructField("flag_one", BooleanType, false),
>   StructField("flag_two", BooleanType, false),
>   StructField("flag_three", BooleanType, false)
> ))
> val rowRdd = spark.sparkContext.parallelize(Seq(
>   Row("1", true, false, false),
>   Row("2", false, true, false),
>   Row("3", false, false, true)
> ))
> spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")
> val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
> schema)
> //Select columns out of order with respect to the emptyData schema
> val data = emptyData.union(spark.sql("select id, flag_two, flag_three, 
> flag_one from temp_flags"))
> {code}
> Selecting the data from the "temp_flags" table results in:
> {noformat}
> spark.sql("select * from temp_flags").show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> +---+++--+
> {noformat}
> Which is the data we'd expect but when inspecting "data" we get:
> {noformat}
> data.show()
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}
> Having a non-empty dataframe on the left side of the union doesn't seem to 
> make a difference either:
> {noformat}
> spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, 
> flag_three, flag_one from temp_flags")).show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema

2017-10-23 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282
 ] 

Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:18 PM:
-

This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

{code:java}

  case class AB(a : String, b : String)

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // So does this
{code}


So its inconsistent IMO.


was (Author: cbribiescas):
This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
{code:java}

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // So does this
{code}


So its inconsistent IMO.

> Union uses column order rather than schema
> --
>
> Key: SPARK-20761
> URL: https://issues.apache.org/jira/browse/SPARK-20761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nakul Jeirath
>Priority: Minor
>
> I believe there is an issue when using union to combine two dataframes when 
> the order of columns differ between the left and right side of the union:
> {code}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{BooleanType, StringType, StructField, 
> StructType}
> val schema = StructType(Seq(
>   StructField("id", StringType, false),
>   StructField("flag_one", BooleanType, false),
>   StructField("flag_two", BooleanType, false),
>   StructField("flag_three", BooleanType, false)
> ))
> val rowRdd = spark.sparkContext.parallelize(Seq(
>   Row("1", true, false, false),
>   Row("2", false, true, false),
>   Row("3", false, false, true)
> ))
> spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")
> val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
> schema)
> //Select columns out of order with respect to the emptyData schema
> val data = emptyData.union(spark.sql("select id, flag_two, flag_three, 
> flag_one from temp_flags"))
> {code}
> Selecting the data from the "temp_flags" table results in:
> {noformat}
> spark.sql("select * from temp_flags").show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> +---+++--+
> {noformat}
> Which is the data we'd expect but when inspecting "data" we get:
> {noformat}
> data.show()
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}
> Having a non-empty dataframe on the left side of the union doesn't seem to 
> make a difference either:
> {noformat}
> spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, 
> flag_three, flag_one from temp_flags")).show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++---

[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema

2017-10-23 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282
 ] 

Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:18 PM:
-

This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
{code:java}

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped by type, not by column order

  baDs.map(_.a).show() // However, this gives the correct result, even though 
columns were out of order.
  abDs.map(_.a).show() // So does this
{code}


So its inconsistent IMO.


was (Author: cbribiescas):
This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
{code:java}

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped

  baDs.map(_.a).show() // However, this gives the correct result
  abDs.map(_.a).show() // So does this
{code}


So its inconsistent IMO.

> Union uses column order rather than schema
> --
>
> Key: SPARK-20761
> URL: https://issues.apache.org/jira/browse/SPARK-20761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nakul Jeirath
>Priority: Minor
>
> I believe there is an issue when using union to combine two dataframes when 
> the order of columns differ between the left and right side of the union:
> {code}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{BooleanType, StringType, StructField, 
> StructType}
> val schema = StructType(Seq(
>   StructField("id", StringType, false),
>   StructField("flag_one", BooleanType, false),
>   StructField("flag_two", BooleanType, false),
>   StructField("flag_three", BooleanType, false)
> ))
> val rowRdd = spark.sparkContext.parallelize(Seq(
>   Row("1", true, false, false),
>   Row("2", false, true, false),
>   Row("3", false, false, true)
> ))
> spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")
> val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
> schema)
> //Select columns out of order with respect to the emptyData schema
> val data = emptyData.union(spark.sql("select id, flag_two, flag_three, 
> flag_one from temp_flags"))
> {code}
> Selecting the data from the "temp_flags" table results in:
> {noformat}
> spark.sql("select * from temp_flags").show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> +---+++--+
> {noformat}
> Which is the data we'd expect but when inspecting "data" we get:
> {noformat}
> data.show()
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}
> Having a non-empty dataframe on the left side of the union doesn't seem to 
> make a difference either:
> {noformat}
> spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, 
> flag_three, flag_one from temp_flags")).show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true|

[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema

2017-10-23 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282
 ] 

Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:17 PM:
-

This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
{code:java}

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped

  baDs.map(_.a).show() // However, this gives the correct result
  abDs.map(_.a).show() // So does this
{code}


So its inconsistent IMO.


was (Author: cbribiescas):
This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
{code:java}

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped

baDs.map(_.a).show() // However, this gives the correct result
{code}


So its inconsistent IMO.

> Union uses column order rather than schema
> --
>
> Key: SPARK-20761
> URL: https://issues.apache.org/jira/browse/SPARK-20761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nakul Jeirath
>Priority: Minor
>
> I believe there is an issue when using union to combine two dataframes when 
> the order of columns differ between the left and right side of the union:
> {code}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{BooleanType, StringType, StructField, 
> StructType}
> val schema = StructType(Seq(
>   StructField("id", StringType, false),
>   StructField("flag_one", BooleanType, false),
>   StructField("flag_two", BooleanType, false),
>   StructField("flag_three", BooleanType, false)
> ))
> val rowRdd = spark.sparkContext.parallelize(Seq(
>   Row("1", true, false, false),
>   Row("2", false, true, false),
>   Row("3", false, false, true)
> ))
> spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")
> val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
> schema)
> //Select columns out of order with respect to the emptyData schema
> val data = emptyData.union(spark.sql("select id, flag_two, flag_three, 
> flag_one from temp_flags"))
> {code}
> Selecting the data from the "temp_flags" table results in:
> {noformat}
> spark.sql("select * from temp_flags").show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> +---+++--+
> {noformat}
> Which is the data we'd expect but when inspecting "data" we get:
> {noformat}
> data.show()
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}
> Having a non-empty dataframe on the left side of the union doesn't seem to 
> make a difference either:
> {noformat}
> spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, 
> flag_three, flag_one from temp_flags")).show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> |  1|   false|   false|  true|
> |  2|true|   false|

[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema

2017-10-23 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282
 ] 

Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:17 PM:
-

This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
{code:java}

  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped

baDs.map(_.a).show() // However, this gives the correct result
{code}


So its inconsistent IMO.


was (Author: cbribiescas):
This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped
baDs.map(_.a).show() // However, this gives the correct result

So its inconsistent IMO.

> Union uses column order rather than schema
> --
>
> Key: SPARK-20761
> URL: https://issues.apache.org/jira/browse/SPARK-20761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nakul Jeirath
>Priority: Minor
>
> I believe there is an issue when using union to combine two dataframes when 
> the order of columns differ between the left and right side of the union:
> {code}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{BooleanType, StringType, StructField, 
> StructType}
> val schema = StructType(Seq(
>   StructField("id", StringType, false),
>   StructField("flag_one", BooleanType, false),
>   StructField("flag_two", BooleanType, false),
>   StructField("flag_three", BooleanType, false)
> ))
> val rowRdd = spark.sparkContext.parallelize(Seq(
>   Row("1", true, false, false),
>   Row("2", false, true, false),
>   Row("3", false, false, true)
> ))
> spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")
> val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
> schema)
> //Select columns out of order with respect to the emptyData schema
> val data = emptyData.union(spark.sql("select id, flag_two, flag_three, 
> flag_one from temp_flags"))
> {code}
> Selecting the data from the "temp_flags" table results in:
> {noformat}
> spark.sql("select * from temp_flags").show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> +---+++--+
> {noformat}
> Which is the data we'd expect but when inspecting "data" we get:
> {noformat}
> data.show()
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}
> Having a non-empty dataframe on the left side of the union doesn't seem to 
> make a difference either:
> {noformat}
> spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, 
> flag_three, flag_one from temp_flags")).show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+---

[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema

2017-10-23 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282
 ] 

Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:16 PM:
-

This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped
baDs.map(_.a).show() // However, this gives the correct result

So its inconsistent IMO.


was (Author: cbribiescas):
This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped

> Union uses column order rather than schema
> --
>
> Key: SPARK-20761
> URL: https://issues.apache.org/jira/browse/SPARK-20761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nakul Jeirath
>Priority: Minor
>
> I believe there is an issue when using union to combine two dataframes when 
> the order of columns differ between the left and right side of the union:
> {code}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{BooleanType, StringType, StructField, 
> StructType}
> val schema = StructType(Seq(
>   StructField("id", StringType, false),
>   StructField("flag_one", BooleanType, false),
>   StructField("flag_two", BooleanType, false),
>   StructField("flag_three", BooleanType, false)
> ))
> val rowRdd = spark.sparkContext.parallelize(Seq(
>   Row("1", true, false, false),
>   Row("2", false, true, false),
>   Row("3", false, false, true)
> ))
> spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")
> val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
> schema)
> //Select columns out of order with respect to the emptyData schema
> val data = emptyData.union(spark.sql("select id, flag_two, flag_three, 
> flag_one from temp_flags"))
> {code}
> Selecting the data from the "temp_flags" table results in:
> {noformat}
> spark.sql("select * from temp_flags").show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> +---+++--+
> {noformat}
> Which is the data we'd expect but when inspecting "data" we get:
> {noformat}
> data.show()
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}
> Having a non-empty dataframe on the left side of the union doesn't seem to 
> make a difference either:
> {noformat}
> spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, 
> flag_three, flag_one from temp_flags")).show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema

2017-10-23 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282
 ] 

Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:15 PM:
-

This isn't quite the issue I'm facing, but solving this issue will fix my 
issue.  (probably)

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped


was (Author: cbribiescas):
This isn't quite the issue I'm facing, but solving this issue will fix my issue.

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped

> Union uses column order rather than schema
> --
>
> Key: SPARK-20761
> URL: https://issues.apache.org/jira/browse/SPARK-20761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nakul Jeirath
>Priority: Minor
>
> I believe there is an issue when using union to combine two dataframes when 
> the order of columns differ between the left and right side of the union:
> {code}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{BooleanType, StringType, StructField, 
> StructType}
> val schema = StructType(Seq(
>   StructField("id", StringType, false),
>   StructField("flag_one", BooleanType, false),
>   StructField("flag_two", BooleanType, false),
>   StructField("flag_three", BooleanType, false)
> ))
> val rowRdd = spark.sparkContext.parallelize(Seq(
>   Row("1", true, false, false),
>   Row("2", false, true, false),
>   Row("3", false, false, true)
> ))
> spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")
> val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
> schema)
> //Select columns out of order with respect to the emptyData schema
> val data = emptyData.union(spark.sql("select id, flag_two, flag_three, 
> flag_one from temp_flags"))
> {code}
> Selecting the data from the "temp_flags" table results in:
> {noformat}
> spark.sql("select * from temp_flags").show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> +---+++--+
> {noformat}
> Which is the data we'd expect but when inspecting "data" we get:
> {noformat}
> data.show()
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}
> Having a non-empty dataframe on the left side of the union doesn't seem to 
> make a difference either:
> {noformat}
> spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, 
> flag_three, flag_one from temp_flags")).show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@s

[jira] [Commented] (SPARK-20761) Union uses column order rather than schema

2017-10-23 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282
 ] 

Carlos Bribiescas commented on SPARK-20761:
---

This isn't quite the issue I'm facing, but solving this issue will fix my issue.

I see union uses column order for a DF.  This to me is "fine" since they aren't 
typed.  

However, for a dataset which is supposed to be strongly typed it is actually 
giving the wrong result.  If you try to access the members by name, it will use 
the order.   Heres is a reproducible case.  2.2.0

  case class AB(a : String, b : String)
  val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b")
  val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a")
  
  abDf.union(baDf).show() // as this ticket states, its "Not a problem"
  
  val abDs = abDf.as[AB]
  val baDs = baDf.as[AB]
  
  abDs.union(baDs).show()
  
  abDs.union(baDs).map(_.a).show() // this gives wrong result since a 
Dataset[AB] should be correctly mapped

> Union uses column order rather than schema
> --
>
> Key: SPARK-20761
> URL: https://issues.apache.org/jira/browse/SPARK-20761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Nakul Jeirath
>Priority: Minor
>
> I believe there is an issue when using union to combine two dataframes when 
> the order of columns differ between the left and right side of the union:
> {code}
> import org.apache.spark.sql.{Row, SparkSession}
> import org.apache.spark.sql.types.{BooleanType, StringType, StructField, 
> StructType}
> val schema = StructType(Seq(
>   StructField("id", StringType, false),
>   StructField("flag_one", BooleanType, false),
>   StructField("flag_two", BooleanType, false),
>   StructField("flag_three", BooleanType, false)
> ))
> val rowRdd = spark.sparkContext.parallelize(Seq(
>   Row("1", true, false, false),
>   Row("2", false, true, false),
>   Row("3", false, false, true)
> ))
> spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags")
> val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], 
> schema)
> //Select columns out of order with respect to the emptyData schema
> val data = emptyData.union(spark.sql("select id, flag_two, flag_three, 
> flag_one from temp_flags"))
> {code}
> Selecting the data from the "temp_flags" table results in:
> {noformat}
> spark.sql("select * from temp_flags").show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> +---+++--+
> {noformat}
> Which is the data we'd expect but when inspecting "data" we get:
> {noformat}
> data.show()
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}
> Having a non-empty dataframe on the left side of the union doesn't seem to 
> make a difference either:
> {noformat}
> spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, 
> flag_three, flag_one from temp_flags")).show
> +---+++--+
> | id|flag_one|flag_two|flag_three|
> +---+++--+
> |  1|true|   false| false|
> |  2|   false|true| false|
> |  3|   false|   false|  true|
> |  1|   false|   false|  true|
> |  2|true|   false| false|
> |  3|   false|true| false|
> +---+++--+
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15703) Make ListenerBus event queue size configurable

2017-07-17 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090991#comment-16090991
 ] 

Carlos Bribiescas edited comment on SPARK-15703 at 7/18/17 2:15 AM:


Does this only affect the UI or will jobs actually not process correctly when 
this happens?


was (Author: cbribiescas):
Does this only affect the UI or will jobs actually not work correctly when this 
happens?

> Make ListenerBus event queue size configurable
> --
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, spark-dynamic-executor-allocation.png, 
> SparkListenerBus .png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15703) Make ListenerBus event queue size configurable

2017-07-17 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090991#comment-16090991
 ] 

Carlos Bribiescas commented on SPARK-15703:
---

Does this only affect the UI or will jobs actually not work correctly when this 
happens?

> Make ListenerBus event queue size configurable
> --
>
> Key: SPARK-15703
> URL: https://issues.apache.org/jira/browse/SPARK-15703
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Web UI
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>Assignee: Dhruve Ashar
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
> Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot 
> 2016-06-01 at 11.23.48 AM.png, spark-dynamic-executor-allocation.png, 
> SparkListenerBus .png
>
>
> The Spark UI doesn't seem to be showing all the tasks and metrics.
> I ran a job with 10 tasks but Detail stage page says it completed 93029:
> Summary Metrics for 93029 Completed Tasks
> The Stages for all jobs pages list that only 89519/10 tasks finished but 
> its completed.  The metrics for shuffled write and input are also incorrect.
> I will attach screen shots.
> I checked the logs and it does show that all the tasks actually finished.
> 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 
> (TID 54038) in 265309 ms on 10.213.45.51 (10/10)
> 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2016-03-01 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157683#comment-15157683
 ] 

Carlos Bribiescas edited comment on SPARK-10795 at 3/1/16 2:08 PM:
---

Using this command to submit job {code}spark-submit --master yarn-cluster 
--num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 
MyPythonFile.py{code}

If MyPythonFile.py looks like this
{code}
from pyspark import SparkContext

jobName="My Name"
sc = SparkContext(appName=jobName)

{code}
Then everything is fine.  If MyPythonFile.py does not specify a spark context 
(As one would in the interactive shell) then it gives me the error you say in 
your bug.  Using the following file instead I'm able to reproduce the bug.

{code}
from pyspark import SparkContext

jobName="My Name"
# sc = SparkContext(appName=jobName)

{code}

So I suspect you just didn't define a spark context properly for a cluster.  
Hope this helps.

{code}Cluster Configuration
Release label:emr-4.2.0
Hadoop distribution:Amazon 2.6.0
Applications:SPARK 1.5.2, HIVE 1.0.0, HUE 3.7.1{code}



was (Author: cbribiescas):
Using this command to submit job {code}spark-submit --master yarn-cluster 
--num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 
MyPythonFile.py{code}

If MyPythonFile.py looks like this
{code}
from pyspark import SparkContext

jobName="My Name"
sc = SparkContext(appName=jobName)

{code}
Then everything is fine.  If MyPythonFile.py does not specify a spark context 
(As one would in the interactive shell) then it gives me the error you say in 
your bug.  Using the following file instead I'm able to reproduce the bug.

{code}
from pyspark import SparkContext

jobName="My Name"
# sc = SparkContext(appName=jobName)

{code}

So I suspect you just didn't define a spark context properly for a cluster.  
Hope this helps


> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2016-03-01 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173723#comment-15173723
 ] 

Carlos Bribiescas commented on SPARK-10795:
---

Have you tried just specifying the SparkContext and nothing else?  For example, 
if you tried to specify a Master via the spark context but also did so on the 
command line I don't know the expected output.  I suggest doing that before 
trying to cut up your code too much.

I do realize that there may be many other causes of this issue, so I dont mean 
to suggest that not initializing your SparkContext is the only way. Just trying 
to rule this one cause out.

> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2016-02-22 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157683#comment-15157683
 ] 

Carlos Bribiescas edited comment on SPARK-10795 at 2/22/16 8:58 PM:


Using this command to submit job {code}spark-submit --master yarn-cluster 
--num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 
MyPythonFile.py{code}

if MyPythonFile.py looks like this
{code}
from pyspark import SparkContext

jobName="My Name"
sc = SparkContext(appName=jobName)

{code}
Then everything is fine.  If MyPythonFile.py does not specify a spark context 
(As one would in the interactive shell) then it gives me the error you say in 
your bug.  Using the following file instead I'm able to reproduce the bug.

{code}
from pyspark import SparkContext

jobName="My Name"
# sc = SparkContext(appName=jobName)

{code}

So I suspect you just didn't define a spark context properly for a cluster.  
Hope this helps



was (Author: cbribiescas):
Using this command spark-submit --master yarn-cluster --num-executors 1 
--driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py

if MyPythonFile.py looks like this
{code}
from pyspark import SparkContext

jobName="My Name"
sc = SparkContext(appName=jobName)

{code}
Then everything is fine.  If MyPythonFile.py does not specify a spark context 
(As one would in the interactive shell) then it gives me the error you say in 
your bug.  Using the following file instead I'm able to reproduce the bug.

{code}
from pyspark import SparkContext

jobName="My Name"
# sc = SparkContext(appName=jobName)

{code}

So I suspect you just didn't define a spark context properly for a cluster.  
Hope this helps


> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2016-02-22 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157683#comment-15157683
 ] 

Carlos Bribiescas edited comment on SPARK-10795 at 2/22/16 8:58 PM:


Using this command to submit job {code}spark-submit --master yarn-cluster 
--num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 
MyPythonFile.py{code}

If MyPythonFile.py looks like this
{code}
from pyspark import SparkContext

jobName="My Name"
sc = SparkContext(appName=jobName)

{code}
Then everything is fine.  If MyPythonFile.py does not specify a spark context 
(As one would in the interactive shell) then it gives me the error you say in 
your bug.  Using the following file instead I'm able to reproduce the bug.

{code}
from pyspark import SparkContext

jobName="My Name"
# sc = SparkContext(appName=jobName)

{code}

So I suspect you just didn't define a spark context properly for a cluster.  
Hope this helps



was (Author: cbribiescas):
Using this command to submit job {code}spark-submit --master yarn-cluster 
--num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 
MyPythonFile.py{code}

if MyPythonFile.py looks like this
{code}
from pyspark import SparkContext

jobName="My Name"
sc = SparkContext(appName=jobName)

{code}
Then everything is fine.  If MyPythonFile.py does not specify a spark context 
(As one would in the interactive shell) then it gives me the error you say in 
your bug.  Using the following file instead I'm able to reproduce the bug.

{code}
from pyspark import SparkContext

jobName="My Name"
# sc = SparkContext(appName=jobName)

{code}

So I suspect you just didn't define a spark context properly for a cluster.  
Hope this helps


> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2016-02-22 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157683#comment-15157683
 ] 

Carlos Bribiescas edited comment on SPARK-10795 at 2/22/16 8:57 PM:


Using this command spark-submit --master yarn-cluster --num-executors 1 
--driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py

if MyPythonFile.py looks like this
{code}
from pyspark import SparkContext

jobName="My Name"
sc = SparkContext(appName=jobName)

{code}
Then everything is fine.  If MyPythonFile.py does not specify a spark context 
(As one would in the interactive shell) then it gives me the error you say in 
your bug.  Using the following file instead I'm able to reproduce the bug.

{code}
from pyspark import SparkContext

jobName="My Name"
# sc = SparkContext(appName=jobName)

{code}

So I suspect you just didn't define a spark context properly for a cluster.  
Hope this helps



was (Author: cbribiescas):
Using this command spark-submit --master yarn-cluster --num-executors 1 
--driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py

if MyPythonFile.py looks like this
{code}
from pyspark import SparkContext

jobName="My Name"
sc = SparkContext(appName=jobName)

{code}
Then everything is fine.  

If MyPythonFile.py does not specify a spark context (As one would in the 
interactive shell) then it gives me the error you say in your bug.

However if I use this file instead

{code}
from pyspark import SparkContext

jobName="My Name"
# sc = SparkContext(appName=jobName)

{code}

So I suspect you just didn't define a spark context properly for a cluster.  
Hope this helps


> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2016-02-22 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157683#comment-15157683
 ] 

Carlos Bribiescas commented on SPARK-10795:
---

Using this command spark-submit --master yarn-cluster --num-executors 1 
--driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py

if MyPythonFile.py looks like this
{code}
from pyspark import SparkContext

jobName="My Name"
sc = SparkContext(appName=jobName)

{code}
Then everything is fine.  

If MyPythonFile.py does not specify a spark context (As one would in the 
interactive shell) then it gives me the error you say in your bug.

However if I use this file instead

{code}
from pyspark import SparkContext

jobName="My Name"
# sc = SparkContext(appName=jobName)

{code}

So I suspect you just didn't define a spark context properly for a cluster.  
Hope this helps


> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2016-02-22 Thread Carlos Bribiescas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Carlos Bribiescas updated SPARK-10795:
--
Comment: was deleted

(was: What is the command you use when this happens?  I had this issue 
previously but only when using --py-files in my spark-submit.  Not other than 
that though.)

> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster

2016-02-22 Thread Carlos Bribiescas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157612#comment-15157612
 ] 

Carlos Bribiescas commented on SPARK-10795:
---

What is the command you use when this happens?  I had this issue previously but 
only when using --py-files in my spark-submit.  Not other than that though.

> FileNotFoundException while deploying pyspark job on cluster
> 
>
> Key: SPARK-10795
> URL: https://issues.apache.org/jira/browse/SPARK-10795
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: EMR 
>Reporter: Harshit
>
> I am trying to run simple spark job using pyspark, it works as standalone , 
> but while I deploy over cluster it fails.
> Events :
> 2015-09-24 10:38:49,602 INFO  [main] yarn.Client (Logging.scala:logInfo(59)) 
> - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> 
> hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> Above uploading resource file is successfull , I manually checked file is 
> present in above specified path , but after a while I face following error :
> Diagnostics: File does not exist: 
> hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip
> java.io.FileNotFoundException: File does not exist: 
> hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org