[jira] [Closed] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order

2020-01-15 Thread Henrique dos Santos Goulart (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-23273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrique dos Santos Goulart closed SPARK-23273.
---

> Spark Dataset withColumn - schema column order isn't the same as case class 
> paramether order
> 
>
> Key: SPARK-23273
> URL: https://issues.apache.org/jira/browse/SPARK-23273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Henrique dos Santos Goulart
>Priority: Major
>
> {code:java}
> case class OnlyAge(age: Int)
> case class NameAge(name: String, age: Int)
> val ds1 = spark.emptyDataset[NameAge]
> val ds2 = spark
>   .createDataset(Seq(OnlyAge(1)))
>   .withColumn("name", lit("henriquedsg89"))
>   .as[NameAge]
> ds1.show()
> ds2.show()
> ds1.union(ds2)
> {code}
>  
> It's going to raise this error:
> {noformat}
> Cannot up cast `age` from string to int as it may truncate
> The type path of the target object is:
> - field (class: "scala.Int", name: "age")
> - root class: "dw.NameAge"{noformat}
> It seems that .as[CaseClass] doesn't keep the order of paramethers that is 
> typed on case class.
> If I change the case class paramether order, it's going to work... like: 
> {code:java}
> case class NameAge(age: Int, name: String){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10063) Remove DirectParquetOutputCommitter

2018-02-06 Thread Henrique dos Santos Goulart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrique dos Santos Goulart updated SPARK-10063:

Comment: was deleted

(was: There is any alternative right now that works with Parquet that uses 
partitionBy? Because it works very well if I set version=2 and do not use 
paritionBy parquet, but if I use 
dataset.partitionBy(..).option(..algorithm.version", "2").parquet(...) it will 
create temporary folders =(

Reference question: 
[https://stackoverflow.com/questions/48573643/spark-dataset-parquet-partition-on-s3-creating-temporary-folder]

 

[~rxin] [~yhuai] [~ste...@apache.org] [~chiragvaya])

> Remove DirectParquetOutputCommitter
> ---
>
> Key: SPARK-10063
> URL: https://issues.apache.org/jira/browse/SPARK-10063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
> there is a chance that we can loss data. 
> Here is the code to reproduce the problem.
> {code}
> import org.apache.spark.sql.functions._
> val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: 
> Int, partitionId: Int, attemptNumber: Int) => {
>   if (partitionId == 0 && i == 5) {
> if (attemptNumber > 0) {
>   Thread.sleep(15000)
>   throw new Exception("new exception")
> } else {
>   Thread.sleep(1)
> }
>   }
>   
>   i
> })
> val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
>   val context = org.apache.spark.TaskContext.get()
>   val partitionId = context.partitionId
>   val attemptNumber = context.attemptNumber
>   iter.map(i => (i, partitionId, attemptNumber))
> }.toDF("i", "partitionId", "attemptNumber")
> df
>   .select(failSpeculativeTask($"i", $"partitionId", 
> $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
>   .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
> sqlContext.read.load("/home/yin/outputCommitter").count
> // The result is 99 and 5 is missing from the output.
> {code}
> What happened is that the original task finishes first and uploads its output 
> file to S3, then the speculative task somehow fails. Because we have to call 
> output stream's close method, which uploads data to S3, we actually uploads 
> the partial result generated by the failed speculative task to S3 and this 
> file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10063) Remove DirectParquetOutputCommitter

2018-02-01 Thread Henrique dos Santos Goulart (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349550#comment-16349550
 ] 

Henrique dos Santos Goulart edited comment on SPARK-10063 at 2/2/18 12:11 AM:
--

There is any alternative right now that works with Parquet that uses 
partitionBy? Because it works very well if I set version=2 and do not use 
paritionBy parquet, but if I use 
dataset.partitionBy(..).option(..algorithm.version", "2").parquet(...) it will 
create temporary folders =(

Reference question: 
[https://stackoverflow.com/questions/48573643/spark-dataset-parquet-partition-on-s3-creating-temporary-folder]

 

[~rxin] [~yhuai] [~ste...@apache.org] [~chiragvaya]


was (Author: henriquedsg89):
There is any alternative right now that works with Parquet that uses 
partitionBy? Because if I use 
dataset.partitionBy(..).option(..algorithm.version", "2").parquet(...) it will 
create temporary folders =(

Reference question: 
https://stackoverflow.com/questions/48573643/spark-dataset-parquet-partition-on-s3-creating-temporary-folder

> Remove DirectParquetOutputCommitter
> ---
>
> Key: SPARK-10063
> URL: https://issues.apache.org/jira/browse/SPARK-10063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
> there is a chance that we can loss data. 
> Here is the code to reproduce the problem.
> {code}
> import org.apache.spark.sql.functions._
> val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: 
> Int, partitionId: Int, attemptNumber: Int) => {
>   if (partitionId == 0 && i == 5) {
> if (attemptNumber > 0) {
>   Thread.sleep(15000)
>   throw new Exception("new exception")
> } else {
>   Thread.sleep(1)
> }
>   }
>   
>   i
> })
> val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
>   val context = org.apache.spark.TaskContext.get()
>   val partitionId = context.partitionId
>   val attemptNumber = context.attemptNumber
>   iter.map(i => (i, partitionId, attemptNumber))
> }.toDF("i", "partitionId", "attemptNumber")
> df
>   .select(failSpeculativeTask($"i", $"partitionId", 
> $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
>   .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
> sqlContext.read.load("/home/yin/outputCommitter").count
> // The result is 99 and 5 is missing from the output.
> {code}
> What happened is that the original task finishes first and uploads its output 
> file to S3, then the speculative task somehow fails. Because we have to call 
> output stream's close method, which uploads data to S3, we actually uploads 
> the partial result generated by the failed speculative task to S3 and this 
> file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10063) Remove DirectParquetOutputCommitter

2018-02-01 Thread Henrique dos Santos Goulart (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349550#comment-16349550
 ] 

Henrique dos Santos Goulart commented on SPARK-10063:
-

There is any alternative right now that works with Parquet that uses 
partitionBy? Because if I use 
dataset.partitionBy(..).option(..algorithm.version", "2").parquet(...) it will 
create temporary folders =(

Reference question: 
https://stackoverflow.com/questions/48573643/spark-dataset-parquet-partition-on-s3-creating-temporary-folder

> Remove DirectParquetOutputCommitter
> ---
>
> Key: SPARK-10063
> URL: https://issues.apache.org/jira/browse/SPARK-10063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 2.0.0
>
>
> When we use DirectParquetOutputCommitter on S3 and speculation is enabled, 
> there is a chance that we can loss data. 
> Here is the code to reproduce the problem.
> {code}
> import org.apache.spark.sql.functions._
> val failSpeculativeTask = sqlContext.udf.register("failSpeculativeTask", (i: 
> Int, partitionId: Int, attemptNumber: Int) => {
>   if (partitionId == 0 && i == 5) {
> if (attemptNumber > 0) {
>   Thread.sleep(15000)
>   throw new Exception("new exception")
> } else {
>   Thread.sleep(1)
> }
>   }
>   
>   i
> })
> val df = sc.parallelize((1 to 100), 20).mapPartitions { iter =>
>   val context = org.apache.spark.TaskContext.get()
>   val partitionId = context.partitionId
>   val attemptNumber = context.attemptNumber
>   iter.map(i => (i, partitionId, attemptNumber))
> }.toDF("i", "partitionId", "attemptNumber")
> df
>   .select(failSpeculativeTask($"i", $"partitionId", 
> $"attemptNumber").as("i"), $"partitionId", $"attemptNumber")
>   .write.mode("overwrite").format("parquet").save("/home/yin/outputCommitter")
> sqlContext.read.load("/home/yin/outputCommitter").count
> // The result is 99 and 5 is missing from the output.
> {code}
> What happened is that the original task finishes first and uploads its output 
> file to S3, then the speculative task somehow fails. Because we have to call 
> output stream's close method, which uploads data to S3, we actually uploads 
> the partial result generated by the failed speculative task to S3 and this 
> file overwrites the correct file generated by the original task.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order

2018-01-30 Thread Henrique dos Santos Goulart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrique dos Santos Goulart updated SPARK-23273:

Description: 
{code:java}
case class OnlyAge(age: Int)
case class NameAge(name: String, age: Int)

val ds1 = spark.emptyDataset[NameAge]
val ds2 = spark
  .createDataset(Seq(OnlyAge(1)))
  .withColumn("name", lit("henriquedsg89"))
  .as[NameAge]

ds1.show()
ds2.show()

ds1.union(ds2)
{code}
 

It's going to raise this error:
{noformat}
Cannot up cast `age` from string to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: "dw.NameAge"{noformat}
It seems that .as[CaseClass] doesn't keep the order of paramethers that is 
typed on case class.

If I change the case class paramether order, it's going to work... like: `case 
class NameAge(age: Int, name: String)`

  was:
{code}
case class OnlyAge(age: Int)
case class NameAge(name: String, age: Int)

val ds1 = spark.emptyDataset[NameAge]
val ds2 = spark
  .createDataset(Seq(OnlyAge(1)))
  .withColumn("name", lit("henriquedsg89"))
  .as[NameAge]

ds1.show()
ds2.show()

ds1.union(ds2)
{code}
 

It's going to raise this error:
{noformat}
Cannot up cast `age` from string to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: "dw.NameAge"{noformat}
It seems that .as[CaseClass] doesn't keep the order of paramethers that is 
typed on case class.


> Spark Dataset withColumn - schema column order isn't the same as case class 
> paramether order
> 
>
> Key: SPARK-23273
> URL: https://issues.apache.org/jira/browse/SPARK-23273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Henrique dos Santos Goulart
>Priority: Major
>
> {code:java}
> case class OnlyAge(age: Int)
> case class NameAge(name: String, age: Int)
> val ds1 = spark.emptyDataset[NameAge]
> val ds2 = spark
>   .createDataset(Seq(OnlyAge(1)))
>   .withColumn("name", lit("henriquedsg89"))
>   .as[NameAge]
> ds1.show()
> ds2.show()
> ds1.union(ds2)
> {code}
>  
> It's going to raise this error:
> {noformat}
> Cannot up cast `age` from string to int as it may truncate
> The type path of the target object is:
> - field (class: "scala.Int", name: "age")
> - root class: "dw.NameAge"{noformat}
> It seems that .as[CaseClass] doesn't keep the order of paramethers that is 
> typed on case class.
> If I change the case class paramether order, it's going to work... like: 
> `case class NameAge(age: Int, name: String)`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order

2018-01-30 Thread Henrique dos Santos Goulart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Henrique dos Santos Goulart updated SPARK-23273:

Description: 
{code:java}
case class OnlyAge(age: Int)
case class NameAge(name: String, age: Int)

val ds1 = spark.emptyDataset[NameAge]
val ds2 = spark
  .createDataset(Seq(OnlyAge(1)))
  .withColumn("name", lit("henriquedsg89"))
  .as[NameAge]

ds1.show()
ds2.show()

ds1.union(ds2)
{code}
 

It's going to raise this error:
{noformat}
Cannot up cast `age` from string to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: "dw.NameAge"{noformat}
It seems that .as[CaseClass] doesn't keep the order of paramethers that is 
typed on case class.

If I change the case class paramether order, it's going to work... like: 
{code:java}
case class NameAge(age: Int, name: String){code}

  was:
{code:java}
case class OnlyAge(age: Int)
case class NameAge(name: String, age: Int)

val ds1 = spark.emptyDataset[NameAge]
val ds2 = spark
  .createDataset(Seq(OnlyAge(1)))
  .withColumn("name", lit("henriquedsg89"))
  .as[NameAge]

ds1.show()
ds2.show()

ds1.union(ds2)
{code}
 

It's going to raise this error:
{noformat}
Cannot up cast `age` from string to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: "dw.NameAge"{noformat}
It seems that .as[CaseClass] doesn't keep the order of paramethers that is 
typed on case class.

If I change the case class paramether order, it's going to work... like: `case 
class NameAge(age: Int, name: String)`


> Spark Dataset withColumn - schema column order isn't the same as case class 
> paramether order
> 
>
> Key: SPARK-23273
> URL: https://issues.apache.org/jira/browse/SPARK-23273
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Henrique dos Santos Goulart
>Priority: Major
>
> {code:java}
> case class OnlyAge(age: Int)
> case class NameAge(name: String, age: Int)
> val ds1 = spark.emptyDataset[NameAge]
> val ds2 = spark
>   .createDataset(Seq(OnlyAge(1)))
>   .withColumn("name", lit("henriquedsg89"))
>   .as[NameAge]
> ds1.show()
> ds2.show()
> ds1.union(ds2)
> {code}
>  
> It's going to raise this error:
> {noformat}
> Cannot up cast `age` from string to int as it may truncate
> The type path of the target object is:
> - field (class: "scala.Int", name: "age")
> - root class: "dw.NameAge"{noformat}
> It seems that .as[CaseClass] doesn't keep the order of paramethers that is 
> typed on case class.
> If I change the case class paramether order, it's going to work... like: 
> {code:java}
> case class NameAge(age: Int, name: String){code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23273) Spark Dataset withColumn - schema column order isn't the same as case class paramether order

2018-01-30 Thread Henrique dos Santos Goulart (JIRA)
Henrique dos Santos Goulart created SPARK-23273:
---

 Summary: Spark Dataset withColumn - schema column order isn't the 
same as case class paramether order
 Key: SPARK-23273
 URL: https://issues.apache.org/jira/browse/SPARK-23273
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Henrique dos Santos Goulart


{code}
case class OnlyAge(age: Int)
case class NameAge(name: String, age: Int)

val ds1 = spark.emptyDataset[NameAge]
val ds2 = spark
  .createDataset(Seq(OnlyAge(1)))
  .withColumn("name", lit("henriquedsg89"))
  .as[NameAge]

ds1.show()
ds2.show()

ds1.union(ds2)
{code}
 

It's going to raise this error:
{noformat}
Cannot up cast `age` from string to int as it may truncate
The type path of the target object is:
- field (class: "scala.Int", name: "age")
- root class: "dw.NameAge"{noformat}
It seems that .as[CaseClass] doesn't keep the order of paramethers that is 
typed on case class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org