[jira] [Closed] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order
[ https://issues.apache.org/jira/browse/SPARK-23273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrique dos Santos Goulart closed SPARK-23273. --- > Spark Dataset withColumn - schema column order isn't the same as case class > paramether order > > > Key: SPARK-23273 > URL: https://issues.apache.org/jira/browse/SPARK-23273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Henrique dos Santos Goulart >Priority: Major > > {code:java} > case class OnlyAge(age: Int) > case class NameAge(name: String, age: Int) > val ds1 = spark.emptyDataset[NameAge] > val ds2 = spark > .createDataset(Seq(OnlyAge(1))) > .withColumn("name", lit("henriquedsg89")) > .as[NameAge] > ds1.show() > ds2.show() > ds1.union(ds2) > {code} > > It's going to raise this error: > {noformat} > Cannot up cast `age` from string to int as it may truncate > The type path of the target object is: > - field (class: "scala.Int", name: "age") > - root class: "dw.NameAge"{noformat} > It seems that .as[CaseClass] doesn't keep the order of paramethers that is > typed on case class. > If I change the case class paramether order, it's going to work... like: > {code:java} > case class NameAge(age: Int, name: String){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10063) Remove DirectParquetOutputCommitter
[ https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrique dos Santos Goulart updated SPARK-10063: Comment: was deleted (was: There is any alternative right now that works with Parquet that uses partitionBy? Because it works very well if I set version=2 and do not use paritionBy parquet, but if I use dataset.partitionBy(..).option(..algorithm.version", "2").parquet(...) it will create temporary folders =( Reference question: [https://stackoverflow.com/questions/48573643/spark-dataset-parquet-partition-on-s3-creating-temporary-folder] [~rxin] [~yhuai] [~ste...@apache.org] [~chiragvaya]) > Remove DirectParquetOutputCommitter > --- > > Key: SPARK-10063 > URL: https://issues.apache.org/jira/browse/SPARK-10063 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Reynold Xin >Priority: Critical > Fix For: 2.0.0 > > > When we use DirectParquetOutputCommitter on S3 and speculation is enabled, > there is a chance that we can loss data. > Here is the code to reproduce the problem. > {code} > import org.apache.spark.sql.functions._ > val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: > Int, partitionId: Int, attemptNumber: Int) => { > if (partitionId == 0 && i == 5) { > if (attemptNumber > 0) { > Thread.sleep(15000) > throw new Exception("new exception") > } else { > Thread.sleep(1) > } > } > > i > }) > val df = sc.parallelize((1 to 100), 20).mapPartitions { iter => > val context = org.apache.spark.TaskContext.get() > val partitionId = context.partitionId > val attemptNumber = context.attemptNumber > iter.map(i => (i, partitionId, attemptNumber)) > }.toDF("i", "partitionId", "attemptNumber") > df > .select(failSpeculativeTask($"i", $"partitionId", > $"attemptNumber").as("i"), $"partitionId", $"attemptNumber") > .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter") > sqlContext.read.load("/home/yin/outputCommitter").count > // The result is 99 and 5 is missing from the output. > {code} > What happened is that the original task finishes first and uploads its output > file to S3, then the speculative task somehow fails. Because we have to call > output stream's close method, which uploads data to S3, we actually uploads > the partial result generated by the failed speculative task to S3 and this > file overwrites the correct file generated by the original task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10063) Remove DirectParquetOutputCommitter
[ https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349550#comment-16349550 ] Henrique dos Santos Goulart edited comment on SPARK-10063 at 2/2/18 12:11 AM: -- There is any alternative right now that works with Parquet that uses partitionBy? Because it works very well if I set version=2 and do not use paritionBy parquet, but if I use dataset.partitionBy(..).option(..algorithm.version", "2").parquet(...) it will create temporary folders =( Reference question: [https://stackoverflow.com/questions/48573643/spark-dataset-parquet-partition-on-s3-creating-temporary-folder] [~rxin] [~yhuai] [~ste...@apache.org] [~chiragvaya] was (Author: henriquedsg89): There is any alternative right now that works with Parquet that uses partitionBy? Because if I use dataset.partitionBy(..).option(..algorithm.version", "2").parquet(...) it will create temporary folders =( Reference question: https://stackoverflow.com/questions/48573643/spark-dataset-parquet-partition-on-s3-creating-temporary-folder > Remove DirectParquetOutputCommitter > --- > > Key: SPARK-10063 > URL: https://issues.apache.org/jira/browse/SPARK-10063 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Reynold Xin >Priority: Critical > Fix For: 2.0.0 > > > When we use DirectParquetOutputCommitter on S3 and speculation is enabled, > there is a chance that we can loss data. > Here is the code to reproduce the problem. > {code} > import org.apache.spark.sql.functions._ > val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: > Int, partitionId: Int, attemptNumber: Int) => { > if (partitionId == 0 && i == 5) { > if (attemptNumber > 0) { > Thread.sleep(15000) > throw new Exception("new exception") > } else { > Thread.sleep(1) > } > } > > i > }) > val df = sc.parallelize((1 to 100), 20).mapPartitions { iter => > val context = org.apache.spark.TaskContext.get() > val partitionId = context.partitionId > val attemptNumber = context.attemptNumber > iter.map(i => (i, partitionId, attemptNumber)) > }.toDF("i", "partitionId", "attemptNumber") > df > .select(failSpeculativeTask($"i", $"partitionId", > $"attemptNumber").as("i"), $"partitionId", $"attemptNumber") > .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter") > sqlContext.read.load("/home/yin/outputCommitter").count > // The result is 99 and 5 is missing from the output. > {code} > What happened is that the original task finishes first and uploads its output > file to S3, then the speculative task somehow fails. Because we have to call > output stream's close method, which uploads data to S3, we actually uploads > the partial result generated by the failed speculative task to S3 and this > file overwrites the correct file generated by the original task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter
[ https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349550#comment-16349550 ] Henrique dos Santos Goulart commented on SPARK-10063: - There is any alternative right now that works with Parquet that uses partitionBy? Because if I use dataset.partitionBy(..).option(..algorithm.version", "2").parquet(...) it will create temporary folders =( Reference question: https://stackoverflow.com/questions/48573643/spark-dataset-parquet-partition-on-s3-creating-temporary-folder > Remove DirectParquetOutputCommitter > --- > > Key: SPARK-10063 > URL: https://issues.apache.org/jira/browse/SPARK-10063 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Reynold Xin >Priority: Critical > Fix For: 2.0.0 > > > When we use DirectParquetOutputCommitter on S3 and speculation is enabled, > there is a chance that we can loss data. > Here is the code to reproduce the problem. > {code} > import org.apache.spark.sql.functions._ > val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: > Int, partitionId: Int, attemptNumber: Int) => { > if (partitionId == 0 && i == 5) { > if (attemptNumber > 0) { > Thread.sleep(15000) > throw new Exception("new exception") > } else { > Thread.sleep(1) > } > } > > i > }) > val df = sc.parallelize((1 to 100), 20).mapPartitions { iter => > val context = org.apache.spark.TaskContext.get() > val partitionId = context.partitionId > val attemptNumber = context.attemptNumber > iter.map(i => (i, partitionId, attemptNumber)) > }.toDF("i", "partitionId", "attemptNumber") > df > .select(failSpeculativeTask($"i", $"partitionId", > $"attemptNumber").as("i"), $"partitionId", $"attemptNumber") > .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter") > sqlContext.read.load("/home/yin/outputCommitter").count > // The result is 99 and 5 is missing from the output. > {code} > What happened is that the original task finishes first and uploads its output > file to S3, then the speculative task somehow fails. Because we have to call > output stream's close method, which uploads data to S3, we actually uploads > the partial result generated by the failed speculative task to S3 and this > file overwrites the correct file generated by the original task. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order
[ https://issues.apache.org/jira/browse/SPARK-23273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrique dos Santos Goulart updated SPARK-23273: Description: {code:java} case class OnlyAge(age: Int) case class NameAge(name: String, age: Int) val ds1 = spark.emptyDataset[NameAge] val ds2 = spark .createDataset(Seq(OnlyAge(1))) .withColumn("name", lit("henriquedsg89")) .as[NameAge] ds1.show() ds2.show() ds1.union(ds2) {code} It's going to raise this error: {noformat} Cannot up cast `age` from string to int as it may truncate The type path of the target object is: - field (class: "scala.Int", name: "age") - root class: "dw.NameAge"{noformat} It seems that .as[CaseClass] doesn't keep the order of paramethers that is typed on case class. If I change the case class paramether order, it's going to work... like: `case class NameAge(age: Int, name: String)` was: {code} case class OnlyAge(age: Int) case class NameAge(name: String, age: Int) val ds1 = spark.emptyDataset[NameAge] val ds2 = spark .createDataset(Seq(OnlyAge(1))) .withColumn("name", lit("henriquedsg89")) .as[NameAge] ds1.show() ds2.show() ds1.union(ds2) {code} It's going to raise this error: {noformat} Cannot up cast `age` from string to int as it may truncate The type path of the target object is: - field (class: "scala.Int", name: "age") - root class: "dw.NameAge"{noformat} It seems that .as[CaseClass] doesn't keep the order of paramethers that is typed on case class. > Spark Dataset withColumn - schema column order isn't the same as case class > paramether order > > > Key: SPARK-23273 > URL: https://issues.apache.org/jira/browse/SPARK-23273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Henrique dos Santos Goulart >Priority: Major > > {code:java} > case class OnlyAge(age: Int) > case class NameAge(name: String, age: Int) > val ds1 = spark.emptyDataset[NameAge] > val ds2 = spark > .createDataset(Seq(OnlyAge(1))) > .withColumn("name", lit("henriquedsg89")) > .as[NameAge] > ds1.show() > ds2.show() > ds1.union(ds2) > {code} > > It's going to raise this error: > {noformat} > Cannot up cast `age` from string to int as it may truncate > The type path of the target object is: > - field (class: "scala.Int", name: "age") > - root class: "dw.NameAge"{noformat} > It seems that .as[CaseClass] doesn't keep the order of paramethers that is > typed on case class. > If I change the case class paramether order, it's going to work... like: > `case class NameAge(age: Int, name: String)` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order
[ https://issues.apache.org/jira/browse/SPARK-23273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Henrique dos Santos Goulart updated SPARK-23273: Description: {code:java} case class OnlyAge(age: Int) case class NameAge(name: String, age: Int) val ds1 = spark.emptyDataset[NameAge] val ds2 = spark .createDataset(Seq(OnlyAge(1))) .withColumn("name", lit("henriquedsg89")) .as[NameAge] ds1.show() ds2.show() ds1.union(ds2) {code} It's going to raise this error: {noformat} Cannot up cast `age` from string to int as it may truncate The type path of the target object is: - field (class: "scala.Int", name: "age") - root class: "dw.NameAge"{noformat} It seems that .as[CaseClass] doesn't keep the order of paramethers that is typed on case class. If I change the case class paramether order, it's going to work... like: {code:java} case class NameAge(age: Int, name: String){code} was: {code:java} case class OnlyAge(age: Int) case class NameAge(name: String, age: Int) val ds1 = spark.emptyDataset[NameAge] val ds2 = spark .createDataset(Seq(OnlyAge(1))) .withColumn("name", lit("henriquedsg89")) .as[NameAge] ds1.show() ds2.show() ds1.union(ds2) {code} It's going to raise this error: {noformat} Cannot up cast `age` from string to int as it may truncate The type path of the target object is: - field (class: "scala.Int", name: "age") - root class: "dw.NameAge"{noformat} It seems that .as[CaseClass] doesn't keep the order of paramethers that is typed on case class. If I change the case class paramether order, it's going to work... like: `case class NameAge(age: Int, name: String)` > Spark Dataset withColumn - schema column order isn't the same as case class > paramether order > > > Key: SPARK-23273 > URL: https://issues.apache.org/jira/browse/SPARK-23273 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Henrique dos Santos Goulart >Priority: Major > > {code:java} > case class OnlyAge(age: Int) > case class NameAge(name: String, age: Int) > val ds1 = spark.emptyDataset[NameAge] > val ds2 = spark > .createDataset(Seq(OnlyAge(1))) > .withColumn("name", lit("henriquedsg89")) > .as[NameAge] > ds1.show() > ds2.show() > ds1.union(ds2) > {code} > > It's going to raise this error: > {noformat} > Cannot up cast `age` from string to int as it may truncate > The type path of the target object is: > - field (class: "scala.Int", name: "age") > - root class: "dw.NameAge"{noformat} > It seems that .as[CaseClass] doesn't keep the order of paramethers that is > typed on case class. > If I change the case class paramether order, it's going to work... like: > {code:java} > case class NameAge(age: Int, name: String){code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order
Henrique dos Santos Goulart created SPARK-23273: --- Summary: Spark Dataset withColumn - schema column order isn't the same as case class paramether order Key: SPARK-23273 URL: https://issues.apache.org/jira/browse/SPARK-23273 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: Henrique dos Santos Goulart {code} case class OnlyAge(age: Int) case class NameAge(name: String, age: Int) val ds1 = spark.emptyDataset[NameAge] val ds2 = spark .createDataset(Seq(OnlyAge(1))) .withColumn("name", lit("henriquedsg89")) .as[NameAge] ds1.show() ds2.show() ds1.union(ds2) {code} It's going to raise this error: {noformat} Cannot up cast `age` from string to int as it may truncate The type path of the target object is: - field (class: "scala.Int", name: "age") - root class: "dw.NameAge"{noformat} It seems that .as[CaseClass] doesn't keep the order of paramethers that is typed on case class. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org