[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17359064#comment-17359064 ] Carlos Bribiescas commented on SPARK-3159: -- [~xujiajin] Have added this request https://issues.apache.org/jira/browse/SPARK-34591 to address if you want to follow. > Check for reducible DecisionTree > > > Key: SPARK-3159 > URL: https://issues.apache.org/jira/browse/SPARK-3159 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Alessandro Solimando >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2020-05-24-23-00-38-419.png > > > Improvement: test-time computation > Currently, pairs of leaf nodes with the same parent can both output the same > prediction. This happens since the splitting criterion (e.g., Gini) is not > the same as prediction accuracy/MSE; the splitting criterion can sometimes be > improved even when both children would still output the same prediction > (e.g., based on the majority label for classification). > We could check the tree and reduce it if possible after training. > Note: This happens with scikit-learn as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3159) Check for reducible DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293294#comment-17293294 ] Carlos Bribiescas edited comment on SPARK-3159 at 3/2/21, 2:20 AM: --- Would setting `setCanMergeChildren` in construction disable this? Or am i reading the PR wrong? was (Author: cbribiescas): Would setting `canMergeChildren` in construction disable this? Or am i reading the PR wrong? > Check for reducible DecisionTree > > > Key: SPARK-3159 > URL: https://issues.apache.org/jira/browse/SPARK-3159 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Alessandro Solimando >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2020-05-24-23-00-38-419.png > > > Improvement: test-time computation > Currently, pairs of leaf nodes with the same parent can both output the same > prediction. This happens since the splitting criterion (e.g., Gini) is not > the same as prediction accuracy/MSE; the splitting criterion can sometimes be > improved even when both children would still output the same prediction > (e.g., based on the majority label for classification). > We could check the tree and reduce it if possible after training. > Note: This happens with scikit-learn as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293294#comment-17293294 ] Carlos Bribiescas commented on SPARK-3159: -- Would setting `canMergeChildren` in construction disable this? Or am i reading the PR wrong? > Check for reducible DecisionTree > > > Key: SPARK-3159 > URL: https://issues.apache.org/jira/browse/SPARK-3159 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Alessandro Solimando >Priority: Minor > Fix For: 2.4.0 > > Attachments: image-2020-05-24-23-00-38-419.png > > > Improvement: test-time computation > Currently, pairs of leaf nodes with the same parent can both output the same > prediction. This happens since the splitting criterion (e.g., Gini) is not > the same as prediction accuracy/MSE; the splitting criterion can sometimes be > improved even when both children would still output the same prediction > (e.g., based on the majority label for classification). > We could check the tree and reduce it if possible after training. > Note: This happens with scikit-learn as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster
[ https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441257#comment-16441257 ] Carlos Bribiescas commented on SPARK-21063: --- Any update or workarounds for this? > Spark return an empty result from remote hadoop cluster > --- > > Key: SPARK-21063 > URL: https://issues.apache.org/jira/browse/SPARK-21063 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.1.0, 2.1.1 >Reporter: Peter Bykov >Priority: Major > > Spark returning empty result from when querying remote hadoop cluster. > All firewall settings removed. > Querying using JDBC working properly using hive-jdbc driver from version 1.1.1 > Code snippet is: > {code:java} > val spark = SparkSession.builder > .appName("RemoteSparkTest") > .master("local") > .getOrCreate() > val df = spark.read > .option("url", "jdbc:hive2://remote.hive.local:1/default") > .option("user", "user") > .option("password", "pass") > .option("dbtable", "test_table") > .option("driver", "org.apache.hive.jdbc.HiveDriver") > .format("jdbc") > .load() > > df.show() > {code} > Result: > {noformat} > +---+ > |test_table.test_col| > +---+ > +---+ > {noformat} > All manipulations like: > {code:java} > df.select(*).show() > {code} > returns empty result too. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218465#comment-16218465 ] Carlos Bribiescas edited comment on SPARK-22335 at 10/25/17 12:26 PM: -- [~viirya] That is definitely a work around. And effectively Union will sometimes work correctly for typed things. Unlike in RDD which is also type, but works consistently. I guess this is just a point where two different concepts are at ends. I think this was what [~dongjoon] was getting at. (maybe?) Suppose you're working with a traditional typed data structure. It has an operation that worked with case classes only sometimes, depending on how they were constructed. Like, if you created this case class via reflection, then flat map won’t work correctly for some reason that can be explained and documented. I also understand that SQL does unions by column order, and thats how union traditionally works in that space. So we have two things which aren't Spark, but which Spark is inspired by, that would do two operations in a slightly different way. Updating documentation is definitely a good step at making the api more useable, but ultimately I guess the decision is to go, in this case, with a more SQL-like approach rather than real typing. was (Author: cbribiescas): [~viirya] That is definitely a work around. And effectively Union will sometimes work correctly for typed things. Unlike in RDD which is also type, but works consistently. I guess this is just a point where two different concepts are at ends. I think this was what [~dongjoon] was getting at. (maybe?) Suppose you're working with a traditional typed data structure. It has an operation that worked with case classes only sometimes, depending on how they were constructed. Like, if you created this case class via reflection, then flat map won’t work correctly for some reason that can be explained and documented. I also understand that SQL does unions by column order, and thats why thats how union traditionally works. So we have two things which aren't spark, but which spark is inspired by that would do two things in a slightly different way. Updating documentation is definitely a good step at making the api more useable, but ultimately I guess the decision is to go, in this case, with a more SQL-like approach rather than real typing. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done w
[jira] [Comment Edited] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218465#comment-16218465 ] Carlos Bribiescas edited comment on SPARK-22335 at 10/25/17 12:21 PM: -- [~viirya] That is definitely a work around. And effectively Union will sometimes work correctly for typed things. Unlike in RDD which is also type, but works consistently. I guess this is just a point where two different concepts are at ends. I think this was what [~dongjoon] was getting at. (maybe?) Suppose you're working with a traditional typed data structure. It has an operation that worked with case classes only sometimes, depending on how they were constructed. Like, if you created this case class via reflection, then flat map won’t work correctly for some reason that can be explained and documented. I also understand that SQL does unions by column order, and thats why thats how union traditionally works. So we have two things which aren't spark, but which spark is inspired by that would do two things in a slightly different way. Updating documentation is definitely a good step at making the api more useable, but ultimately I guess the decision is to go, in this case, with a more SQL-like approach rather than real typing. was (Author: cbribiescas): [~viirya] That is definitely a work around. And effectively Union will sometimes work correctly for typed things. Unlike in RDD which is also typed and also works consistently. I guess this is just a point where two different concepts are at ends. I think this was what [~dongjoon] was getting at. The traditional notion of a DS/DF is at ends with the be implied ability of working with a typed data structure. Consider you were working with any collection. Then the api for that data structure had operations that worked with case classes only sometimes, depending on how they were constructed. Like, if you created this case class via reflection, then flat map won’t work correctly for some reason that can be explained and documented. Updating documentation is definitely a good step at making the api more useable, it’s just that my feeling is that a typed object is a typed no matter what. I also understand that a Datasets implementation is strongly tied to row order as well. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation..
[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16218465#comment-16218465 ] Carlos Bribiescas commented on SPARK-22335: --- [~viirya] That is definitely a work around. And effectively Union will sometimes work correctly for typed things. Unlike in RDD which is also typed and also works consistently. I guess this is just a point where two different concepts are at ends. I think this was what [~dongjoon] was getting at. The traditional notion of a DS/DF is at ends with the be implied ability of working with a typed data structure. Consider you were working with any collection. Then the api for that data structure had operations that worked with case classes only sometimes, depending on how they were constructed. Like, if you created this case class via reflection, then flat map won’t work correctly for some reason that can be explained and documented. Updating documentation is definitely a good step at making the api more useable, it’s just that my feeling is that a typed object is a typed no matter what. I also understand that a Datasets implementation is strongly tied to row order as well. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217497#comment-16217497 ] Carlos Bribiescas commented on SPARK-22335: --- I'm not sure I understand what you're asking. I do agree that DS should be consistent with DF when possible, but in this case the more specific functionality (typing) doesn't apply to DF. Again, sorry if I didn't answer your question I didn't quite get what you were asking. Can you clarify? Here is another example that maybe helps. {code:java} case class AB(a : String, b : Int) val abDs = sc.parallelize(List(("aThing",0))).toDF("a", "b").as[AB] val baDs = sc.parallelize(List((0,"aThing"))).toDF("b", "a").as[AB] abDs.show() // works baDs.show() // works abDs.union(baDs).show() // Real error to do with types abDs.rdd.union(baDs.rdd).toDF().as[AB].show() // Works which is inconsistent with last statement IMO {code} > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21043) Add unionByName API to Dataset
[ https://issues.apache.org/jira/browse/SPARK-21043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216953#comment-16216953 ] Carlos Bribiescas commented on SPARK-21043: --- I really like this feature. Is there a motivation not to replace union with this functionality, other than backwards compatibility? > Add unionByName API to Dataset > -- > > Key: SPARK-21043 > URL: https://issues.apache.org/jira/browse/SPARK-21043 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Takeshi Yamamuro > Fix For: 2.3.0 > > > It would be useful to add unionByName which resolves columns by name, in > addition to the existing union (which resolves by position). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16216950#comment-16216950 ] Carlos Bribiescas commented on SPARK-22335: --- I think if unionByName replaced union then it would be a solution. Its definitely a workaround... But as the api stands it feels like a bug since Dataset is supposed to be typed. Again, I suspect it has to do with the optimizer pushing the typing to a later step, after the union by column order happens. If this is the root cause of the bug I worry how else its being manifested, that is, what other bugs it may cause. I'll have to think about it a bit more. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives > correct result > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Bribiescas updated SPARK-22335: -- Description: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround for linked issue, slightly modified. However this seems wrong since its supposed to be strongly typed baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct result, which is logically inconsistent behavior abDs.rdd.union(baDs.rdd).toDF().show() // Simpler example that gives correct result {code} So its inconsistent and a bug IMO. And I'm not sure that the suggested work around is really fair, since I'm supposed to be getting of type `AB`. More importantly I think the issue is bigger when you consider that it happens even if you read from parquet (as you would expect). And that its inconsistent when going to/from rdd. I imagine its just lazily converting to typed DS instead of initially. So either that typing could be prioritized to happen before the union or unioning of DF could be done with column order taken into account. Again, this is speculation.. was: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround for linked issue, slightly modified. However this seems wrong since its supposed to be strongly typed baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct result, which is logically inconsistent behavior {code} So its inconsistent and a bug IMO. And I'm not sure that the suggested work around is really fair, since I'm supposed to be getting of type `AB`. More importantly I think the issue is bigger when you consider that it happens even if you read from parquet (as you would expect). And that its inconsistent when going to/from rdd. I imagine its just lazily converting to typed DS instead of initially. So either that typing could be prioritized to happen before the union or unioning of DF could be done with column order taken into account. Again, this is speculation.. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use
[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Bribiescas updated SPARK-22335: -- Priority: Major (was: Minor) > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > > baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct > result, which is logically inconsistent behavior > {code} > So its inconsistent and a bug IMO. And I'm not sure that the suggested work > around is really fair, since I'm supposed to be getting of type `AB`. More > importantly I think the issue is bigger when you consider that it happens > even if you read from parquet (as you would expect). And that its > inconsistent when going to/from rdd. > I imagine its just lazily converting to typed DS instead of initially. So > either that typing could be prioritized to happen before the union or > unioning of DF could be done with column order taken into account. Again, > this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Bribiescas updated SPARK-22335: -- Description: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround for linked issue, slightly modified. However this seems wrong since its supposed to be strongly typed baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct result, which is logically inconsistent behavior {code} So its inconsistent and a bug IMO. And I'm not sure that the suggested work around is really fair, since I'm supposed to be getting of type `AB`. More importantly I think the issue is bigger when you consider that it happens even if you read from parquet (as you would expect). And that its inconsistent when going to/from rdd. I imagine its just lazily converting to typed DS instead of initially. So either that typing could be prioritized to happen before the union or unioning of DF could be done with column order taken into account. Again, this is speculation.. was: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround for linked issue, slightly modified. However this seems wrong since its supposed to be strongly typed baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct result, which is logically inconsistent behavior {code} So its inconsistent and a bug IMO. And I'm not sure that the suggested work around is really fair, since I'm supposed to be getting of type `AB`. More importantly I think the issue is bigger when you consider that it happens even if you read from parquet (as you would expect). And that its inconsistent when going to/from rdd. I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas >Priority: Minor > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b
[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Bribiescas updated SPARK-22335: -- Description: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround for linked issue, slightly modified. However this seems wrong since its supposed to be strongly typed baDs.rdd.toDF().as[AB].union(abDs).show() // This however gives correct result, which is logically inconsistent behavior {code} So its inconsistent and a bug IMO. And I'm not sure that the suggested work around is really fair, since I'm supposed to be getting of type `AB`. More importantly I think the issue is bigger when you consider that it happens even if you read from parquet (as you would expect). And that its inconsistent when going to/from rdd. I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. was: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround for linked issue, slightly modified. However this seems wrong since its supposed to be strongly typed {code} So its inconsistent and a bug IMO. And I'm not sure that the suggested work around is really fair, since I'm supposed to be getting of type `AB` I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas >Priority: Minor > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong re
[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Bribiescas updated SPARK-22335: -- Description: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround for linked issue, slightly modified. However this seems wrong since its supposed to be strongly typed {code} So its inconsistent and a bug IMO. And I'm not sure that the suggested work around is really fair, since I'm supposed to be getting of type `AB` I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. was: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround for linked issue, slightly modified. However this seems wrong since its supposed to be strongly typed {code} So its inconsistent and a bug IMO. And I'm not sure of a workaround if you get handed a DS witho I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas >Priority: Minor > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result
[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Bribiescas updated SPARK-22335: -- Description: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too baDs.select("a","b").as[AB].union(abDs).show() // This is the same workaround for linked issue, slightly modified. However this seems wrong since its supposed to be strongly typed {code} So its inconsistent and a bug IMO. And I'm not sure of a workaround if you get handed a DS witho I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. was: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too {code} So its inconsistent and a bug IMO. I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas >Priority: Minor > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > baDs.select("a","b").as[AB].union(abDs).show() // This is the same > workaround for linked issue, slightly modified. However this seems wrong > since its supposed to be strongly typed > {code} > So
[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Bribiescas updated SPARK-22335: -- Description: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as linked ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).map(_.a).show() // This gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too {code} So its inconsistent and a bug IMO. I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. was: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too {code} So its inconsistent and a bug IMO. I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas >Priority: Minor > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as linked ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() // This gives wrong result since a Dataset[AB] > should be correctly mapped by type, not by column order > > abDs.union(baDs).map(_.a).show() // This gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > {code} > So its inconsistent and a bug IMO. > I imagine its just lazily converting to typed DS instead of initially. So > either that could be prioritized or unioning of DF could be done with column > order taken into account. Again, this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22335) Union for DataSet uses column order instead of types for union
[ https://issues.apache.org/jira/browse/SPARK-22335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Bribiescas updated SPARK-22335: -- Description: I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too {code} So its inconsistent and a bug IMO. I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. was: This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too {code} So its inconsistent and a bug IMO. I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. > Union for DataSet uses column order instead of types for union > -- > > Key: SPARK-22335 > URL: https://issues.apache.org/jira/browse/SPARK-22335 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Carlos Bribiescas >Priority: Minor > > I see union uses column order for a DF. This to me is "fine" since they > aren't typed. > However, for a dataset which is supposed to be strongly typed it is actually > giving the wrong result. If you try to access the members by name, it will > use the order. Heres is a reproducible case. 2.2.0 > {code:java} > case class AB(a : String, b : String) > val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") > val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") > > abDf.union(baDf).show() // as this ticket states, its "Not a problem" > > val abDs = abDf.as[AB] > val baDs = baDf.as[AB] > > abDs.union(baDs).show() > > abDs.union(baDs).map(_.a).show() // this gives wrong result since a > Dataset[AB] should be correctly mapped by type, not by column order >abDs.union(baDs).rdd.take(2) // This also gives wrong result > baDs.map(_.a).show() // However, this gives the correct result, even though > columns were out of order. > abDs.map(_.a).show() // This is correct too > {code} > So its inconsistent and a bug IMO. > I imagine its just lazily converting to typed DS instead of initially. So > either that could be prioritized or unioning of DF could be done with column > order taken into account. Again, this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22335) Union for DataSet uses column order instead of types for union
Carlos Bribiescas created SPARK-22335: - Summary: Union for DataSet uses column order instead of types for union Key: SPARK-22335 URL: https://issues.apache.org/jira/browse/SPARK-22335 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Carlos Bribiescas Priority: Minor This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order abDs.union(baDs).rdd.take(2) // This also gives wrong result baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // This is correct too {code} So its inconsistent and a bug IMO. I imagine its just lazily converting to typed DS instead of initially. So either that could be prioritized or unioning of DF could be done with column order taken into account. Again, this is speculation.. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20761) Union uses column order rather than schema
[ https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Bribiescas updated SPARK-20761: -- Comment: was deleted (was: This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // So does this {code} So its inconsistent IMO.) > Union uses column order rather than schema > -- > > Key: SPARK-20761 > URL: https://issues.apache.org/jira/browse/SPARK-20761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Nakul Jeirath >Priority: Minor > > I believe there is an issue when using union to combine two dataframes when > the order of columns differ between the left and right side of the union: > {code} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{BooleanType, StringType, StructField, > StructType} > val schema = StructType(Seq( > StructField("id", StringType, false), > StructField("flag_one", BooleanType, false), > StructField("flag_two", BooleanType, false), > StructField("flag_three", BooleanType, false) > )) > val rowRdd = spark.sparkContext.parallelize(Seq( > Row("1", true, false, false), > Row("2", false, true, false), > Row("3", false, false, true) > )) > spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags") > val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], > schema) > //Select columns out of order with respect to the emptyData schema > val data = emptyData.union(spark.sql("select id, flag_two, flag_three, > flag_one from temp_flags")) > {code} > Selecting the data from the "temp_flags" table results in: > {noformat} > spark.sql("select * from temp_flags").show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > +---+++--+ > {noformat} > Which is the data we'd expect but when inspecting "data" we get: > {noformat} > data.show() > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} > Having a non-empty dataframe on the left side of the union doesn't seem to > make a difference either: > {noformat} > spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, > flag_three, flag_one from temp_flags")).show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema
[ https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282 ] Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:18 PM: - This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 {code:java} case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // So does this {code} So its inconsistent IMO. was (Author: cbribiescas): This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) {code:java} val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // So does this {code} So its inconsistent IMO. > Union uses column order rather than schema > -- > > Key: SPARK-20761 > URL: https://issues.apache.org/jira/browse/SPARK-20761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Nakul Jeirath >Priority: Minor > > I believe there is an issue when using union to combine two dataframes when > the order of columns differ between the left and right side of the union: > {code} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{BooleanType, StringType, StructField, > StructType} > val schema = StructType(Seq( > StructField("id", StringType, false), > StructField("flag_one", BooleanType, false), > StructField("flag_two", BooleanType, false), > StructField("flag_three", BooleanType, false) > )) > val rowRdd = spark.sparkContext.parallelize(Seq( > Row("1", true, false, false), > Row("2", false, true, false), > Row("3", false, false, true) > )) > spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags") > val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], > schema) > //Select columns out of order with respect to the emptyData schema > val data = emptyData.union(spark.sql("select id, flag_two, flag_three, > flag_one from temp_flags")) > {code} > Selecting the data from the "temp_flags" table results in: > {noformat} > spark.sql("select * from temp_flags").show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > +---+++--+ > {noformat} > Which is the data we'd expect but when inspecting "data" we get: > {noformat} > data.show() > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} > Having a non-empty dataframe on the left side of the union doesn't seem to > make a difference either: > {noformat} > spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, > flag_three, flag_one from temp_flags")).show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++---
[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema
[ https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282 ] Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:18 PM: - This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) {code:java} val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped by type, not by column order baDs.map(_.a).show() // However, this gives the correct result, even though columns were out of order. abDs.map(_.a).show() // So does this {code} So its inconsistent IMO. was (Author: cbribiescas): This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) {code:java} val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped baDs.map(_.a).show() // However, this gives the correct result abDs.map(_.a).show() // So does this {code} So its inconsistent IMO. > Union uses column order rather than schema > -- > > Key: SPARK-20761 > URL: https://issues.apache.org/jira/browse/SPARK-20761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Nakul Jeirath >Priority: Minor > > I believe there is an issue when using union to combine two dataframes when > the order of columns differ between the left and right side of the union: > {code} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{BooleanType, StringType, StructField, > StructType} > val schema = StructType(Seq( > StructField("id", StringType, false), > StructField("flag_one", BooleanType, false), > StructField("flag_two", BooleanType, false), > StructField("flag_three", BooleanType, false) > )) > val rowRdd = spark.sparkContext.parallelize(Seq( > Row("1", true, false, false), > Row("2", false, true, false), > Row("3", false, false, true) > )) > spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags") > val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], > schema) > //Select columns out of order with respect to the emptyData schema > val data = emptyData.union(spark.sql("select id, flag_two, flag_three, > flag_one from temp_flags")) > {code} > Selecting the data from the "temp_flags" table results in: > {noformat} > spark.sql("select * from temp_flags").show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > +---+++--+ > {noformat} > Which is the data we'd expect but when inspecting "data" we get: > {noformat} > data.show() > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} > Having a non-empty dataframe on the left side of the union doesn't seem to > make a difference either: > {noformat} > spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, > flag_three, flag_one from temp_flags")).show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true|
[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema
[ https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282 ] Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:17 PM: - This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) {code:java} val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped baDs.map(_.a).show() // However, this gives the correct result abDs.map(_.a).show() // So does this {code} So its inconsistent IMO. was (Author: cbribiescas): This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) {code:java} val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped baDs.map(_.a).show() // However, this gives the correct result {code} So its inconsistent IMO. > Union uses column order rather than schema > -- > > Key: SPARK-20761 > URL: https://issues.apache.org/jira/browse/SPARK-20761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Nakul Jeirath >Priority: Minor > > I believe there is an issue when using union to combine two dataframes when > the order of columns differ between the left and right side of the union: > {code} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{BooleanType, StringType, StructField, > StructType} > val schema = StructType(Seq( > StructField("id", StringType, false), > StructField("flag_one", BooleanType, false), > StructField("flag_two", BooleanType, false), > StructField("flag_three", BooleanType, false) > )) > val rowRdd = spark.sparkContext.parallelize(Seq( > Row("1", true, false, false), > Row("2", false, true, false), > Row("3", false, false, true) > )) > spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags") > val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], > schema) > //Select columns out of order with respect to the emptyData schema > val data = emptyData.union(spark.sql("select id, flag_two, flag_three, > flag_one from temp_flags")) > {code} > Selecting the data from the "temp_flags" table results in: > {noformat} > spark.sql("select * from temp_flags").show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > +---+++--+ > {noformat} > Which is the data we'd expect but when inspecting "data" we get: > {noformat} > data.show() > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} > Having a non-empty dataframe on the left side of the union doesn't seem to > make a difference either: > {noformat} > spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, > flag_three, flag_one from temp_flags")).show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > | 1| false| false| true| > | 2|true| false|
[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema
[ https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282 ] Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:17 PM: - This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) {code:java} val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped baDs.map(_.a).show() // However, this gives the correct result {code} So its inconsistent IMO. was (Author: cbribiescas): This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped baDs.map(_.a).show() // However, this gives the correct result So its inconsistent IMO. > Union uses column order rather than schema > -- > > Key: SPARK-20761 > URL: https://issues.apache.org/jira/browse/SPARK-20761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Nakul Jeirath >Priority: Minor > > I believe there is an issue when using union to combine two dataframes when > the order of columns differ between the left and right side of the union: > {code} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{BooleanType, StringType, StructField, > StructType} > val schema = StructType(Seq( > StructField("id", StringType, false), > StructField("flag_one", BooleanType, false), > StructField("flag_two", BooleanType, false), > StructField("flag_three", BooleanType, false) > )) > val rowRdd = spark.sparkContext.parallelize(Seq( > Row("1", true, false, false), > Row("2", false, true, false), > Row("3", false, false, true) > )) > spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags") > val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], > schema) > //Select columns out of order with respect to the emptyData schema > val data = emptyData.union(spark.sql("select id, flag_two, flag_three, > flag_one from temp_flags")) > {code} > Selecting the data from the "temp_flags" table results in: > {noformat} > spark.sql("select * from temp_flags").show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > +---+++--+ > {noformat} > Which is the data we'd expect but when inspecting "data" we get: > {noformat} > data.show() > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} > Having a non-empty dataframe on the left side of the union doesn't seem to > make a difference either: > {noformat} > spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, > flag_three, flag_one from temp_flags")).show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+---
[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema
[ https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282 ] Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:16 PM: - This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped baDs.map(_.a).show() // However, this gives the correct result So its inconsistent IMO. was (Author: cbribiescas): This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped > Union uses column order rather than schema > -- > > Key: SPARK-20761 > URL: https://issues.apache.org/jira/browse/SPARK-20761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Nakul Jeirath >Priority: Minor > > I believe there is an issue when using union to combine two dataframes when > the order of columns differ between the left and right side of the union: > {code} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{BooleanType, StringType, StructField, > StructType} > val schema = StructType(Seq( > StructField("id", StringType, false), > StructField("flag_one", BooleanType, false), > StructField("flag_two", BooleanType, false), > StructField("flag_three", BooleanType, false) > )) > val rowRdd = spark.sparkContext.parallelize(Seq( > Row("1", true, false, false), > Row("2", false, true, false), > Row("3", false, false, true) > )) > spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags") > val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], > schema) > //Select columns out of order with respect to the emptyData schema > val data = emptyData.union(spark.sql("select id, flag_two, flag_three, > flag_one from temp_flags")) > {code} > Selecting the data from the "temp_flags" table results in: > {noformat} > spark.sql("select * from temp_flags").show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > +---+++--+ > {noformat} > Which is the data we'd expect but when inspecting "data" we get: > {noformat} > data.show() > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} > Having a non-empty dataframe on the left side of the union doesn't seem to > make a difference either: > {noformat} > spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, > flag_three, flag_one from temp_flags")).show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (SPARK-20761) Union uses column order rather than schema
[ https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282 ] Carlos Bribiescas edited comment on SPARK-20761 at 10/23/17 3:15 PM: - This isn't quite the issue I'm facing, but solving this issue will fix my issue. (probably) I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped was (Author: cbribiescas): This isn't quite the issue I'm facing, but solving this issue will fix my issue. I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped > Union uses column order rather than schema > -- > > Key: SPARK-20761 > URL: https://issues.apache.org/jira/browse/SPARK-20761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Nakul Jeirath >Priority: Minor > > I believe there is an issue when using union to combine two dataframes when > the order of columns differ between the left and right side of the union: > {code} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{BooleanType, StringType, StructField, > StructType} > val schema = StructType(Seq( > StructField("id", StringType, false), > StructField("flag_one", BooleanType, false), > StructField("flag_two", BooleanType, false), > StructField("flag_three", BooleanType, false) > )) > val rowRdd = spark.sparkContext.parallelize(Seq( > Row("1", true, false, false), > Row("2", false, true, false), > Row("3", false, false, true) > )) > spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags") > val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], > schema) > //Select columns out of order with respect to the emptyData schema > val data = emptyData.union(spark.sql("select id, flag_two, flag_three, > flag_one from temp_flags")) > {code} > Selecting the data from the "temp_flags" table results in: > {noformat} > spark.sql("select * from temp_flags").show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > +---+++--+ > {noformat} > Which is the data we'd expect but when inspecting "data" we get: > {noformat} > data.show() > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} > Having a non-empty dataframe on the left side of the union doesn't seem to > make a difference either: > {noformat} > spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, > flag_three, flag_one from temp_flags")).show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@s
[jira] [Commented] (SPARK-20761) Union uses column order rather than schema
[ https://issues.apache.org/jira/browse/SPARK-20761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215282#comment-16215282 ] Carlos Bribiescas commented on SPARK-20761: --- This isn't quite the issue I'm facing, but solving this issue will fix my issue. I see union uses column order for a DF. This to me is "fine" since they aren't typed. However, for a dataset which is supposed to be strongly typed it is actually giving the wrong result. If you try to access the members by name, it will use the order. Heres is a reproducible case. 2.2.0 case class AB(a : String, b : String) val abDf = sc.parallelize(List(("aThing","bThing"))).toDF("a", "b") val baDf = sc.parallelize(List(("bThing","aThing"))).toDF("b", "a") abDf.union(baDf).show() // as this ticket states, its "Not a problem" val abDs = abDf.as[AB] val baDs = baDf.as[AB] abDs.union(baDs).show() abDs.union(baDs).map(_.a).show() // this gives wrong result since a Dataset[AB] should be correctly mapped > Union uses column order rather than schema > -- > > Key: SPARK-20761 > URL: https://issues.apache.org/jira/browse/SPARK-20761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 >Reporter: Nakul Jeirath >Priority: Minor > > I believe there is an issue when using union to combine two dataframes when > the order of columns differ between the left and right side of the union: > {code} > import org.apache.spark.sql.{Row, SparkSession} > import org.apache.spark.sql.types.{BooleanType, StringType, StructField, > StructType} > val schema = StructType(Seq( > StructField("id", StringType, false), > StructField("flag_one", BooleanType, false), > StructField("flag_two", BooleanType, false), > StructField("flag_three", BooleanType, false) > )) > val rowRdd = spark.sparkContext.parallelize(Seq( > Row("1", true, false, false), > Row("2", false, true, false), > Row("3", false, false, true) > )) > spark.createDataFrame(rowRdd, schema).createOrReplaceTempView("temp_flags") > val emptyData = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], > schema) > //Select columns out of order with respect to the emptyData schema > val data = emptyData.union(spark.sql("select id, flag_two, flag_three, > flag_one from temp_flags")) > {code} > Selecting the data from the "temp_flags" table results in: > {noformat} > spark.sql("select * from temp_flags").show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > +---+++--+ > {noformat} > Which is the data we'd expect but when inspecting "data" we get: > {noformat} > data.show() > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} > Having a non-empty dataframe on the left side of the union doesn't seem to > make a difference either: > {noformat} > spark.sql("select * from temp_flags").union(spark.sql("select id, flag_two, > flag_three, flag_one from temp_flags")).show > +---+++--+ > | id|flag_one|flag_two|flag_three| > +---+++--+ > | 1|true| false| false| > | 2| false|true| false| > | 3| false| false| true| > | 1| false| false| true| > | 2|true| false| false| > | 3| false|true| false| > +---+++--+ > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15703) Make ListenerBus event queue size configurable
[ https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090991#comment-16090991 ] Carlos Bribiescas edited comment on SPARK-15703 at 7/18/17 2:15 AM: Does this only affect the UI or will jobs actually not process correctly when this happens? was (Author: cbribiescas): Does this only affect the UI or will jobs actually not work correctly when this happens? > Make ListenerBus event queue size configurable > -- > > Key: SPARK-15703 > URL: https://issues.apache.org/jira/browse/SPARK-15703 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Assignee: Dhruve Ashar >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot > 2016-06-01 at 11.23.48 AM.png, spark-dynamic-executor-allocation.png, > SparkListenerBus .png > > > The Spark UI doesn't seem to be showing all the tasks and metrics. > I ran a job with 10 tasks but Detail stage page says it completed 93029: > Summary Metrics for 93029 Completed Tasks > The Stages for all jobs pages list that only 89519/10 tasks finished but > its completed. The metrics for shuffled write and input are also incorrect. > I will attach screen shots. > I checked the logs and it does show that all the tasks actually finished. > 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 > (TID 54038) in 265309 ms on 10.213.45.51 (10/10) > 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks > have all completed, from pool -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15703) Make ListenerBus event queue size configurable
[ https://issues.apache.org/jira/browse/SPARK-15703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090991#comment-16090991 ] Carlos Bribiescas commented on SPARK-15703: --- Does this only affect the UI or will jobs actually not work correctly when this happens? > Make ListenerBus event queue size configurable > -- > > Key: SPARK-15703 > URL: https://issues.apache.org/jira/browse/SPARK-15703 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Web UI >Affects Versions: 2.0.0 >Reporter: Thomas Graves >Assignee: Dhruve Ashar >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > Attachments: Screen Shot 2016-06-01 at 11.21.32 AM.png, Screen Shot > 2016-06-01 at 11.23.48 AM.png, spark-dynamic-executor-allocation.png, > SparkListenerBus .png > > > The Spark UI doesn't seem to be showing all the tasks and metrics. > I ran a job with 10 tasks but Detail stage page says it completed 93029: > Summary Metrics for 93029 Completed Tasks > The Stages for all jobs pages list that only 89519/10 tasks finished but > its completed. The metrics for shuffled write and input are also incorrect. > I will attach screen shots. > I checked the logs and it does show that all the tasks actually finished. > 16/06/01 16:15:42 INFO TaskSetManager: Finished task 59880.0 in stage 2.0 > (TID 54038) in 265309 ms on 10.213.45.51 (10/10) > 16/06/01 16:15:42 INFO YarnClusterScheduler: Removed TaskSet 2.0, whose tasks > have all completed, from pool -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157683#comment-15157683 ] Carlos Bribiescas edited comment on SPARK-10795 at 3/1/16 2:08 PM: --- Using this command to submit job {code}spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py{code} If MyPythonFile.py looks like this {code} from pyspark import SparkContext jobName="My Name" sc = SparkContext(appName=jobName) {code} Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. Using the following file instead I'm able to reproduce the bug. {code} from pyspark import SparkContext jobName="My Name" # sc = SparkContext(appName=jobName) {code} So I suspect you just didn't define a spark context properly for a cluster. Hope this helps. {code}Cluster Configuration Release label:emr-4.2.0 Hadoop distribution:Amazon 2.6.0 Applications:SPARK 1.5.2, HIVE 1.0.0, HUE 3.7.1{code} was (Author: cbribiescas): Using this command to submit job {code}spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py{code} If MyPythonFile.py looks like this {code} from pyspark import SparkContext jobName="My Name" sc = SparkContext(appName=jobName) {code} Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. Using the following file instead I'm able to reproduce the bug. {code} from pyspark import SparkContext jobName="My Name" # sc = SparkContext(appName=jobName) {code} So I suspect you just didn't define a spark context properly for a cluster. Hope this helps > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173723#comment-15173723 ] Carlos Bribiescas commented on SPARK-10795: --- Have you tried just specifying the SparkContext and nothing else? For example, if you tried to specify a Master via the spark context but also did so on the command line I don't know the expected output. I suggest doing that before trying to cut up your code too much. I do realize that there may be many other causes of this issue, so I dont mean to suggest that not initializing your SparkContext is the only way. Just trying to rule this one cause out. > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157683#comment-15157683 ] Carlos Bribiescas edited comment on SPARK-10795 at 2/22/16 8:58 PM: Using this command to submit job {code}spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py{code} if MyPythonFile.py looks like this {code} from pyspark import SparkContext jobName="My Name" sc = SparkContext(appName=jobName) {code} Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. Using the following file instead I'm able to reproduce the bug. {code} from pyspark import SparkContext jobName="My Name" # sc = SparkContext(appName=jobName) {code} So I suspect you just didn't define a spark context properly for a cluster. Hope this helps was (Author: cbribiescas): Using this command spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py if MyPythonFile.py looks like this {code} from pyspark import SparkContext jobName="My Name" sc = SparkContext(appName=jobName) {code} Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. Using the following file instead I'm able to reproduce the bug. {code} from pyspark import SparkContext jobName="My Name" # sc = SparkContext(appName=jobName) {code} So I suspect you just didn't define a spark context properly for a cluster. Hope this helps > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157683#comment-15157683 ] Carlos Bribiescas edited comment on SPARK-10795 at 2/22/16 8:58 PM: Using this command to submit job {code}spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py{code} If MyPythonFile.py looks like this {code} from pyspark import SparkContext jobName="My Name" sc = SparkContext(appName=jobName) {code} Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. Using the following file instead I'm able to reproduce the bug. {code} from pyspark import SparkContext jobName="My Name" # sc = SparkContext(appName=jobName) {code} So I suspect you just didn't define a spark context properly for a cluster. Hope this helps was (Author: cbribiescas): Using this command to submit job {code}spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py{code} if MyPythonFile.py looks like this {code} from pyspark import SparkContext jobName="My Name" sc = SparkContext(appName=jobName) {code} Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. Using the following file instead I'm able to reproduce the bug. {code} from pyspark import SparkContext jobName="My Name" # sc = SparkContext(appName=jobName) {code} So I suspect you just didn't define a spark context properly for a cluster. Hope this helps > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157683#comment-15157683 ] Carlos Bribiescas edited comment on SPARK-10795 at 2/22/16 8:57 PM: Using this command spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py if MyPythonFile.py looks like this {code} from pyspark import SparkContext jobName="My Name" sc = SparkContext(appName=jobName) {code} Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. Using the following file instead I'm able to reproduce the bug. {code} from pyspark import SparkContext jobName="My Name" # sc = SparkContext(appName=jobName) {code} So I suspect you just didn't define a spark context properly for a cluster. Hope this helps was (Author: cbribiescas): Using this command spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py if MyPythonFile.py looks like this {code} from pyspark import SparkContext jobName="My Name" sc = SparkContext(appName=jobName) {code} Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. However if I use this file instead {code} from pyspark import SparkContext jobName="My Name" # sc = SparkContext(appName=jobName) {code} So I suspect you just didn't define a spark context properly for a cluster. Hope this helps > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157683#comment-15157683 ] Carlos Bribiescas commented on SPARK-10795: --- Using this command spark-submit --master yarn-cluster --num-executors 1 --driver-memory 1g --executor-memory 1g --executor-cores 1 MyPythonFile.py if MyPythonFile.py looks like this {code} from pyspark import SparkContext jobName="My Name" sc = SparkContext(appName=jobName) {code} Then everything is fine. If MyPythonFile.py does not specify a spark context (As one would in the interactive shell) then it gives me the error you say in your bug. However if I use this file instead {code} from pyspark import SparkContext jobName="My Name" # sc = SparkContext(appName=jobName) {code} So I suspect you just didn't define a spark context properly for a cluster. Hope this helps > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carlos Bribiescas updated SPARK-10795: -- Comment: was deleted (was: What is the command you use when this happens? I had this issue previously but only when using --py-files in my spark-submit. Not other than that though.) > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10795) FileNotFoundException while deploying pyspark job on cluster
[ https://issues.apache.org/jira/browse/SPARK-10795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15157612#comment-15157612 ] Carlos Bribiescas commented on SPARK-10795: --- What is the command you use when this happens? I had this issue previously but only when using --py-files in my spark-submit. Not other than that though. > FileNotFoundException while deploying pyspark job on cluster > > > Key: SPARK-10795 > URL: https://issues.apache.org/jira/browse/SPARK-10795 > Project: Spark > Issue Type: Bug > Components: PySpark > Environment: EMR >Reporter: Harshit > > I am trying to run simple spark job using pyspark, it works as standalone , > but while I deploy over cluster it fails. > Events : > 2015-09-24 10:38:49,602 INFO [main] yarn.Client (Logging.scala:logInfo(59)) > - Uploading resource file:/usr/lib/spark/python/lib/pyspark.zip -> > hdfs://ip-.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > Above uploading resource file is successfull , I manually checked file is > present in above specified path , but after a while I face following error : > Diagnostics: File does not exist: > hdfs://ip-xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip > java.io.FileNotFoundException: File does not exist: > hdfs://ip-1xxx.ap-southeast-1.compute.internal:8020/user/hadoop/.sparkStaging/application_1439967440341_0461/pyspark.zip -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org