[jira] [Commented] (SPARK-10892) Join with Data Frame returns wrong results

2019-06-16 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865245#comment-16865245
 ] 

Tony Zhang commented on SPARK-10892:


This issue still exists in latest code as of today's Spark code (Spark 3.0.0). 
Took some time debugging and here are my observations:

prcp("value") and tmin("value") are referring to the same Column instance thus 
`select` on the joined Dataframe returns the same value `3`. The same Column 
instance is from the same LogicalPlan held by the Dataframe instance, which is 
created in the `load()` method. This LogicalPlan instance is reused in each 
`filter()` call, then provides the same Column instance with the same column 
name, like "value".

This can be confirmed with statement below:

 
{code:java}
scala> prcp("value").expr == tmin("value").expr
res5: Boolean = true
{code}
 

 
{code:java}
scala> prcp.join(tmin, "date_str").select(prcp("value"), 
tmin("value")).explain(true)
== Parsed Logical Plan ==
Project [value#9L, value#9L]
{code}
 

 
{code:java}
 
scala> prcp.explain(true)
== Parsed Logical Plan ==
'Filter ('metric = PRCP)
+- RelationV2[date_str#7, metric#8, value#9L]

scala> tmin.explain(true)
== Parsed Logical Plan ==
'Filter ('metric = TMIN)
+- RelationV2[date_str#7, metric#8, value#9L]{code}
 

This also explains why there's no such problem if you join the two Dataset 
created seperately (using 2 toDF() calls, or using 2 different load() calls 
etc.), because in that case, 2 different LogicalPlan instances will be created 
thus prcp("value") and tmin("value") will refer to the different column 
instance.

I made a local hack to create a new LogicalPlan when user calls `filter()` with 
different condition expression, then the problem is gone. However I think there 
should be a thorough fix that properly manipulate LogicalPlan instances in 
Dataset, instead of reusing the same instance across all functions. I will 
leave it to spark SQL experts. Hope my debugging above helps a little.

I'm new to spark, so please let me know if I missed sth here :)

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  P

[jira] [Issue Comment Deleted] (SPARK-10892) Join with Data Frame returns wrong results

2019-06-16 Thread Tony Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Zhang updated SPARK-10892:
---
Comment: was deleted

(was: This issue still exists in latest code as of today's Spark code (Spark 
3.0.0). Took some time debugging and here are my observations:

prcp("value") and tmin("value") are referring to the same Column instance thus 
`select` on the joined Dataframe returns the same value `3`. The same Column 
instance is from the same LogicalPlan held by the Dataframe instance, which is 
created in the `load()` method. This LogicalPlan instance is reused in each 
`filter()` call, then provides the same Column instance with the same column 
name, like "value".

This can be confirmed with statement below:

 
{code:java}
scala> prcp("value").expr == tmin("value").expr
res5: Boolean = true
{code}
 

 
{code:java}
scala> prcp.join(tmin, "date_str").select(prcp("value"), 
tmin("value")).explain(true)
== Parsed Logical Plan ==
Project [value#9L, value#9L]
{code}
 

 
{code:java}
 
scala> prcp.explain(true)
== Parsed Logical Plan ==
'Filter ('metric = PRCP)
+- RelationV2[date_str#7, metric#8, value#9L]

scala> tmin.explain(true)
== Parsed Logical Plan ==
'Filter ('metric = TMIN)
+- RelationV2[date_str#7, metric#8, value#9L]{code}
 

This also explains why there's no such problem if you join the two Dataset 
created seperately (using 2 toDF() calls, or using 2 different load() calls 
etc.), because in that case, 2 different LogicalPlan instances will be created 
thus prcp("value") and tmin("value") will refer to the different column 
instance.

I made a local hack to create a new LogicalPlan when user calls `filter()` with 
different condition expression, then the problem is gone. However I think there 
should be a thorough fix that properly manipulate LogicalPlan instances in 
Dataset, instead of reusing the same instance across all functions. I will 
leave it to spark SQL experts. Hope my debugging above helps a little.

I'm new to spark, so please let me know if I missed sth here :))

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW00023272|   

[jira] [Commented] (SPARK-10892) Join with Data Frame returns wrong results

2019-06-16 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865246#comment-16865246
 ] 

Tony Zhang commented on SPARK-10892:


This issue still exists in latest code as of today's Spark code (Spark 3.0.0). 
Took some time debugging and here are my observations:

prcp("value") and tmin("value") are referring to the same Column instance thus 
`select` on the joined Dataframe returns the same value `3`. The same Column 
instance is from the same LogicalPlan held by the Dataframe instance, which is 
created in the `load()` method. This LogicalPlan instance is reused in each 
`filter()` call, then provides the same Column instance with the same column 
name, like "value".

This can be confirmed with statement below:
{code:java}
scala> prcp("value").expr == tmin("value").expr
res5: Boolean = true
{code}
{code:java}
scala> prcp.join(tmin, "date_str").select(prcp("value"), 
tmin("value")).explain(true)
== Parsed Logical Plan ==
Project [value#9L, value#9L]
{code}
{code:java}
 
scala> prcp.explain(true)
== Parsed Logical Plan ==
'Filter ('metric = PRCP)
+- RelationV2[date_str#7, metric#8, value#9L]

scala> tmin.explain(true)
== Parsed Logical Plan ==
'Filter ('metric = TMIN)
+- RelationV2[date_str#7, metric#8, value#9L]{code}
This also explains why there's no such problem if you join the two Dataset 
created seperately (using 2 toDF() calls, or using 2 different load() calls 
etc.), because in that case, 2 different LogicalPlan instances will be created 
thus prcp("value") and tmin("value") will refer to the different column 
instance.

I made a local hack to create a new LogicalPlan when user calls `filter()` with 
different condition expression, then the problem is gone. However I think there 
should be a thorough fix that properly manipulate LogicalPlan instances in 
Dataset, instead of reusing the same instance across all functions. I will 
leave it to spark SQL experts. Hope my debugging above helps a little.

I'm new to spark, so please let me know if I missed sth here :)

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}
> import org.apache.spark.sql._
> val sqlc = new SQLContext(sc)
> val weather = sqlc.read.format("json").load("data.json")
> val prcp = weather.filter("metric = 'PRCP'").as("prcp").cache()
> val tmin = weather.filter("metric = 'TMIN'").as("tmin").cache()
> val tmax = weather.filter("metric = 'TMAX'").as("tmax").cache()
> prcp.filter("year=2012 and month=10").show()
> tmin.filter("year=2012 and month=10").show()
> tmax.filter("year=2012 and month=10").show()
> val out = (prcp.join(tmin, "date_str").join(tmax, "date_str")
>   .select(prcp("year"), prcp("month"), prcp("day"), prcp("date_str"),
> prcp("value").alias("PRCP"), tmin("value").alias("TMIN"),
> tmax("value").alias("TMAX")) )
> out.filter("year=2012 and month=10").show()
> {code}
> The output is:
> {code}
> ++---+--+-+---+-++
> |date_str|day|metric|month|station|value|year|
> ++---+--+-+---+-++
> |20121001|  1|  PRCP|   10|USW00023272|0|2012|
> |20121002|  2|  PRCP|   10|USW00023272|0|2012|
> |20121003|  3|  PRCP|   10|USW00023272|0|2012|
> |20121004|  4|  PRCP|   10|USW00023272|0|2012|
> |20121005|  5|  PRCP|   10|USW00023272|0|2012|
> |20121006|  6|  PRCP|   10|USW00023272|0|2012|
> |20121007|  7|  PRCP|   10|USW00023272|0|2012|
> |20121008|  8|  PRCP|   10|USW00023272|0|2012|
> |20121009|  9|  PRCP|   10|USW00023272|0|2012|
> |20121010| 10|  PRCP|   10|USW00023272|0|2012|
> |20121011| 11|  PRCP|   10|USW00023272|3|2012|
> |20121012| 12|  PRCP|   10|USW00023272|0|2012|
> |20121013| 13|  PRCP|   10|USW00023272|0|2012|
> |20121014| 14|  PRCP|   10|USW00023272|0|2012|
> |20121015| 15|  PRCP|   10|USW00023272|0|2012|
> |20121016| 16|  PRCP|   10|USW00023272|0|2012|
> |20121017| 17|  PRCP|   10|USW00023272|0|2012|
> |20121018| 18|  PRCP|   10|USW00023272|0|2012|
> |20121019| 19|  PRCP|   10|USW000

[jira] [Comment Edited] (SPARK-10892) Join with Data Frame returns wrong results

2019-06-16 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865246#comment-16865246
 ] 

Tony Zhang edited comment on SPARK-10892 at 6/17/19 2:37 AM:
-

This issue still exists in latest code as of today's Spark code (Spark 3.0.0). 
Took some time debugging and here are my observations:

prcp("value") and tmin("value") are referring to the same Column instance thus 
`select` on the joined Dataframe returns the same value `3`. The same Column 
instance is from the same LogicalPlan held by the Dataframe instance, which is 
created in the `load()` method. This LogicalPlan instance is reused in each 
`filter()` call, then provides the same Column instance with the same column 
name, like "value".

This can be confirmed with statement below:
{code:java}
scala> prcp("value").expr == tmin("value").expr
res5: Boolean = true
{code}
{code:java}
scala> prcp.join(tmin, "date_str").select(prcp("value"), 
tmin("value")).explain(true)
== Parsed Logical Plan ==
Project [value#9L, value#9L]
{code}
{code:java}
 
scala> prcp.explain(true)
== Parsed Logical Plan ==
'Filter ('metric = PRCP)
+- RelationV2[date_str#7, metric#8, value#9L]

scala> tmin.explain(true)
== Parsed Logical Plan ==
'Filter ('metric = TMIN)
+- RelationV2[date_str#7, metric#8, value#9L]{code}
This also explains why there's no such problem if you join the two Dataset 
created seperately (using 2 toDF() calls, or using 2 different load() calls 
etc.), because in that case, 2 different LogicalPlan instances will be created 
thus prcp("value") and tmin("value") will refer to the different column 
instance.

I made a local hack to create a new LogicalPlan when user calls `filter()` with 
different condition expression, then the problem is gone. However I think there 
should be a thorough fix that properly manipulate LogicalPlan instances in 
Dataset, instead of reusing the same instance across all functions. I will 
leave it to spark SQL experts. Hope my debugging above helps a little.

I'm new to spark, so please let me know if I missed sth here :)


was (Author: tonix517):
This issue still exists in latest code as of today's Spark code (Spark 3.0.0). 
Took some time debugging and here are my observations:

prcp("value") and tmin("value") are referring to the same Column instance thus 
`select` on the joined Dataframe returns the same value `3`. The same Column 
instance is from the same LogicalPlan held by the Dataframe instance, which is 
created in the `load()` method. This LogicalPlan instance is reused in each 
`filter()` call, then provides the same Column instance with the same column 
name, like "value".

This can be confirmed with statement below:
{code:java}
scala> prcp("value").expr == tmin("value").expr
res5: Boolean = true
{code}
{code:java}
scala> prcp.join(tmin, "date_str").select(prcp("value"), 
tmin("value")).explain(true)
== Parsed Logical Plan ==
Project [value#9L, value#9L]
{code}
{code:java}
 
scala> prcp.explain(true)
== Parsed Logical Plan ==
'Filter ('metric = PRCP)
+- RelationV2[date_str#7, metric#8, value#9L]

scala> tmin.explain(true)
== Parsed Logical Plan ==
'Filter ('metric = TMIN)
+- RelationV2[date_str#7, metric#8, value#9L]{code}
This also explains why there's no such problem if you join the two Dataset 
created seperately (using 2 toDF() calls, or using 2 different load() calls 
etc.), because in that case, 2 different LogicalPlan instances will be created 
thus prcp("value") and tmin("value") will refer to the different column 
instance.

I made a local hack to create a new LogicalPlan when user calls `filter()` with 
different condition expression, then the problem is gone. However I think there 
should be a thorough fix that properly manipulate LogicalPlan instances in 
Dataset, instead of reusing the same instance across all functions. I will 
leave it to spark SQL experts. Hope my debugging above helps a little.

I'm new to spark, so please let me know if I missed sth here :)

> Join with Data Frame returns wrong results
> --
>
> Key: SPARK-10892
> URL: https://issues.apache.org/jira/browse/SPARK-10892
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Ofer Mendelevitch
>Priority: Critical
> Attachments: data.json
>
>
> I'm attaching a simplified reproducible example of the problem:
> 1. Loading a JSON file from HDFS as a Data Frame
> 2. Creating 3 data frames: PRCP, TMIN, TMAX
> 3. Joining the data frames together. Each of those has a column "value" with 
> the same name, so renaming them after the join.
> 4. The output seems incorrect; the first column has the correct values, but 
> the two other columns seem to have a copy of the values from the first column.
> Here's the sample code:
> {code}

[jira] [Commented] (SPARK-27815) do not leak SaveMode to file source v2

2019-06-18 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867136#comment-16867136
 ] 

Tony Zhang commented on SPARK-27815:


[~cloud_fan] Hi Wenchen, do you think it is viable solution mentioned below? 

Create a new V2WriteCommand case class and its Exec named maybe 
_OverwriteByQueryId_ to replace WriteToDataSourceV2, which accepts a QueryId so 
that tests can pass.

Or should we keep WriteToDataSourceV2?

> do not leak SaveMode to file source v2
> --
>
> Key: SPARK-27815
> URL: https://issues.apache.org/jira/browse/SPARK-27815
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> Currently there is a hack in `DataFrameWriter`, which passes `SaveMode` to 
> file source v2. This should be removed and file source v2 should not accept 
> SaveMode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values

2019-06-21 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870057#comment-16870057
 ] 

Tony Zhang commented on SPARK-28135:


*For _ceil_ and _floor_:* in sql/catalyst/expressions/mathExpressions.scala, 
the Ceil code is like below:
{code:java}
override def dataType: DataType = child.dataType match {
  case dt @ DecimalType.Fixed(_, 0) => dt
  case DecimalType.Fixed(precision, scale) =>
DecimalType.bounded(precision - scale + 1, 0)
  case _ => LongType
}

protected override def nullSafeEval(input: Any): Any = child.dataType match {
  case LongType => input.asInstanceOf[Long]
  case DoubleType => f(input.asInstanceOf[Double]).toLong
  case DecimalType.Fixed(_, _) => input.asInstanceOf[Decimal].ceil
}
{code}
I don't know ** why Long is prefered here, but after changing 'LongType' into 
'DoubleType', and then removed the 'toLong' conversion, I got expected output:
{code:java}
spark.sql("select ceil(double(1.2345678901234e+200)), 
ceiling(double(1.2345678901234e+200)), 
floor(double(1.2345678901234e+200))").show

+--+--+---+-+

|CEIL(CAST(1.2345678901234E+200 AS DOUBLE))|CEIL(CAST(1.2345678901234E+200 AS 
DOUBLE))|FLOOR(CAST(1.2345678901234E+200 AS DOUBLE))|POWER(CAST(1 AS DOUBLE), 
CAST(NaN AS DOUBLE))|

+--+--+---+-+

|                       1.2345678901234E200|                       
1.2345678901234E200|                        1.2345678901234E200|

+--+--+---+-+{code}
*Fo**r pow*: it returns "NaN" because that's what Java returns with "1.0^NaN", 
{code:java}
double num = Double.valueOf("NaN");
System.out.println(Math.pow(1.0, num)); // prints out `NaN`
{code}
 

> ceil/ceiling/floor/power returns incorrect values
> -
>
> Key: SPARK-28135
> URL: https://issues.apache.org/jira/browse/SPARK-28135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> select ceil(double(1.2345678901234e+200)), 
> ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), 
> power('1', 'NaN');
> 9223372036854775807   9223372036854775807 9223372036854775807 NaN
> {noformat}
> {noformat}
> postgres=# select ceil(1.2345678901234e+200::float8), 
> ceiling(1.2345678901234e+200::float8), floor(1.2345678901234e+200::float8), 
> power('1', 'NaN');
>  ceil |   ceiling|floor | power
> --+--+--+---
>  1.2345678901234e+200 | 1.2345678901234e+200 | 1.2345678901234e+200 | 1
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values

2019-06-21 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870057#comment-16870057
 ] 

Tony Zhang edited comment on SPARK-28135 at 6/22/19 5:00 AM:
-

*For _ceil_ and _floor_:* in sql/catalyst/expressions/mathExpressions.scala, 
the Ceil code is like below:
{code:java}
override def dataType: DataType = child.dataType match {
  case dt @ DecimalType.Fixed(_, 0) => dt
  case DecimalType.Fixed(precision, scale) =>
DecimalType.bounded(precision - scale + 1, 0)
  case _ => LongType
}

protected override def nullSafeEval(input: Any): Any = child.dataType match {
  case LongType => input.asInstanceOf[Long]
  case DoubleType => f(input.asInstanceOf[Double]).toLong
  case DecimalType.Fixed(_, _) => input.asInstanceOf[Decimal].ceil
}
{code}
I don't know why Long is preferred here, but after changing 'LongType' into 
'DoubleType', and then removed the 'toLong' conversion, I got expected output:
{code:java}
spark.sql("select ceil(double(1.2345678901234e+200)), 
ceiling(double(1.2345678901234e+200)), 
floor(double(1.2345678901234e+200))").show

+--+--+---+-+

|CEIL(CAST(1.2345678901234E+200 AS DOUBLE))|CEIL(CAST(1.2345678901234E+200 AS 
DOUBLE))|FLOOR(CAST(1.2345678901234E+200 AS DOUBLE))|POWER(CAST(1 AS DOUBLE), 
CAST(NaN AS DOUBLE))|

+--+--+---+-+

|                       1.2345678901234E200|                       
1.2345678901234E200|                        1.2345678901234E200|

+--+--+---+-+{code}
*Fo**r pow*: it returns "NaN" because that's what Java returns with "1.0^NaN", 
{code:java}
double num = Double.valueOf("NaN");
System.out.println(Math.pow(1.0, num)); // prints out `NaN`
{code}
 


was (Author: tonix517):
*For _ceil_ and _floor_:* in sql/catalyst/expressions/mathExpressions.scala, 
the Ceil code is like below:
{code:java}
override def dataType: DataType = child.dataType match {
  case dt @ DecimalType.Fixed(_, 0) => dt
  case DecimalType.Fixed(precision, scale) =>
DecimalType.bounded(precision - scale + 1, 0)
  case _ => LongType
}

protected override def nullSafeEval(input: Any): Any = child.dataType match {
  case LongType => input.asInstanceOf[Long]
  case DoubleType => f(input.asInstanceOf[Double]).toLong
  case DecimalType.Fixed(_, _) => input.asInstanceOf[Decimal].ceil
}
{code}
I don't know ** why Long is prefered here, but after changing 'LongType' into 
'DoubleType', and then removed the 'toLong' conversion, I got expected output:
{code:java}
spark.sql("select ceil(double(1.2345678901234e+200)), 
ceiling(double(1.2345678901234e+200)), 
floor(double(1.2345678901234e+200))").show

+--+--+---+-+

|CEIL(CAST(1.2345678901234E+200 AS DOUBLE))|CEIL(CAST(1.2345678901234E+200 AS 
DOUBLE))|FLOOR(CAST(1.2345678901234E+200 AS DOUBLE))|POWER(CAST(1 AS DOUBLE), 
CAST(NaN AS DOUBLE))|

+--+--+---+-+

|                       1.2345678901234E200|                       
1.2345678901234E200|                        1.2345678901234E200|

+--+--+---+-+{code}
*Fo**r pow*: it returns "NaN" because that's what Java returns with "1.0^NaN", 
{code:java}
double num = Double.valueOf("NaN");
System.out.println(Math.pow(1.0, num)); // prints out `NaN`
{code}
 

> ceil/ceiling/floor/power returns incorrect values
> -
>
> Key: SPARK-28135
> URL: https://issues.apache.org/jira/browse/SPARK-28135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> select ceil(double(1.2345678901234e+200)), 
> ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), 
> power('1', 'NaN');
> 9223372036854775807   9223372036854775807 9223372036854775807 NaN
> {noformat}
> {noformat}
> postgres=# select ceil(1.2345678901234e+200::float8), 
> ceiling(1.2345678901234e+200::float8

[jira] [Comment Edited] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values

2019-06-21 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870057#comment-16870057
 ] 

Tony Zhang edited comment on SPARK-28135 at 6/22/19 5:02 AM:
-

*For _ceil_ and _floor_:* in sql/catalyst/expressions/mathExpressions.scala, 
the Ceil code is like below:
{code:java}
override def dataType: DataType = child.dataType match {
  case dt @ DecimalType.Fixed(_, 0) => dt
  case DecimalType.Fixed(precision, scale) =>
DecimalType.bounded(precision - scale + 1, 0)
  case _ => LongType
}

protected override def nullSafeEval(input: Any): Any = child.dataType match {
  case LongType => input.asInstanceOf[Long]
  case DoubleType => f(input.asInstanceOf[Double]).toLong
  case DecimalType.Fixed(_, _) => input.asInstanceOf[Decimal].ceil
}
{code}
9223372036854775807 is the max value of a LongType (2^63-1), and 
1.2345678901234e+200 is way much larger than 2^63-1, in this case LongType max 
value is put here.

I don't know why Long is preferred here, but after changing 'LongType' into 
'DoubleType', and then removed the 'toLong' conversion, I got expected output:
{code:java}
spark.sql("select ceil(double(1.2345678901234e+200)), 
ceiling(double(1.2345678901234e+200)), 
floor(double(1.2345678901234e+200))").show

+--+--+---+-+

|CEIL(CAST(1.2345678901234E+200 AS DOUBLE))|CEIL(CAST(1.2345678901234E+200 AS 
DOUBLE))|FLOOR(CAST(1.2345678901234E+200 AS DOUBLE))|POWER(CAST(1 AS DOUBLE), 
CAST(NaN AS DOUBLE))|

+--+--+---+-+

|                       1.2345678901234E200|                       
1.2345678901234E200|                        1.2345678901234E200|

+--+--+---+-+{code}
*Fo**r pow*: it returns "NaN" because that's what Java returns with "1.0^NaN", 
{code:java}
double num = Double.valueOf("NaN");
System.out.println(Math.pow(1.0, num)); // prints out `NaN`
{code}
 


was (Author: tonix517):
*For _ceil_ and _floor_:* in sql/catalyst/expressions/mathExpressions.scala, 
the Ceil code is like below:
{code:java}
override def dataType: DataType = child.dataType match {
  case dt @ DecimalType.Fixed(_, 0) => dt
  case DecimalType.Fixed(precision, scale) =>
DecimalType.bounded(precision - scale + 1, 0)
  case _ => LongType
}

protected override def nullSafeEval(input: Any): Any = child.dataType match {
  case LongType => input.asInstanceOf[Long]
  case DoubleType => f(input.asInstanceOf[Double]).toLong
  case DecimalType.Fixed(_, _) => input.asInstanceOf[Decimal].ceil
}
{code}
I don't know why Long is preferred here, but after changing 'LongType' into 
'DoubleType', and then removed the 'toLong' conversion, I got expected output:
{code:java}
spark.sql("select ceil(double(1.2345678901234e+200)), 
ceiling(double(1.2345678901234e+200)), 
floor(double(1.2345678901234e+200))").show

+--+--+---+-+

|CEIL(CAST(1.2345678901234E+200 AS DOUBLE))|CEIL(CAST(1.2345678901234E+200 AS 
DOUBLE))|FLOOR(CAST(1.2345678901234E+200 AS DOUBLE))|POWER(CAST(1 AS DOUBLE), 
CAST(NaN AS DOUBLE))|

+--+--+---+-+

|                       1.2345678901234E200|                       
1.2345678901234E200|                        1.2345678901234E200|

+--+--+---+-+{code}
*Fo**r pow*: it returns "NaN" because that's what Java returns with "1.0^NaN", 
{code:java}
double num = Double.valueOf("NaN");
System.out.println(Math.pow(1.0, num)); // prints out `NaN`
{code}
 

> ceil/ceiling/floor/power returns incorrect values
> -
>
> Key: SPARK-28135
> URL: https://issues.apache.org/jira/browse/SPARK-28135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> select ceil(double(1.2345678901234e+200)), 
> ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), 
> power('1', 'NaN');
> 9223372036854775807   922337203

[jira] [Commented] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values

2019-06-22 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870319#comment-16870319
 ] 

Tony Zhang commented on SPARK-28135:


[~yumwang] Ok on my way. BTW how can I assign the ticket to myself?

> ceil/ceiling/floor/power returns incorrect values
> -
>
> Key: SPARK-28135
> URL: https://issues.apache.org/jira/browse/SPARK-28135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> select ceil(double(1.2345678901234e+200)), 
> ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), 
> power('1', 'NaN');
> 9223372036854775807   9223372036854775807 9223372036854775807 NaN
> {noformat}
> {noformat}
> postgres=# select ceil(1.2345678901234e+200::float8), 
> ceiling(1.2345678901234e+200::float8), floor(1.2345678901234e+200::float8), 
> power('1', 'NaN');
>  ceil |   ceiling|floor | power
> --+--+--+---
>  1.2345678901234e+200 | 1.2345678901234e+200 | 1.2345678901234e+200 | 1
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28134) Trigonometric Functions

2019-06-23 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870700#comment-16870700
 ] 

Tony Zhang commented on SPARK-28134:


Hi [~yumwang], I can see these trigonometric functions are defined and 
implemented in mathExpressions.scala, and they work in queries already. Or did 
you mean these functions are not supported in certain use cases?

> Trigonometric Functions
> ---
>
> Key: SPARK-28134
> URL: https://issues.apache.org/jira/browse/SPARK-28134
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function (radians)||Function (degrees)||Description||
> |{{acos(_x_}})|{{acosd(_x_}})|inverse cosine|
> |{{asin(_x_}})|{{asind(_x_}})|inverse sine|
> |{{atan(_x_}})|{{atand(_x_}})|inverse tangent|
> |{{atan2(_y_}}, _{{x}}_)|{{atan2d(_y_}}, _{{x}}_)|inverse tangent of 
> {{_y_}}/_{{x}}_|
> |{{cos(_x_}})|{{cosd(_x_}})|cosine|
> |{{cot(_x_}})|{{cotd(_x_}})|cotangent|
> |{{sin(_x_}})|{{sind(_x_}})|sine|
> |{{tan(_x_}})|{{tand(_x_}})|tangent|
>  
> [https://www.postgresql.org/docs/12/functions-math.html#FUNCTIONS-MATH-TRIG-TABLE]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28133) Hyperbolic Functions

2019-06-23 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870711#comment-16870711
 ] 

Tony Zhang commented on SPARK-28133:


sinh\cosh\tanh exists in code and works in SQL.

asinh\acosh\atanh are missing.

> Hyperbolic Functions
> 
>
> Key: SPARK-28133
> URL: https://issues.apache.org/jira/browse/SPARK-28133
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Description||Example||Result||
> |{{sinh(_x_)}}|hyperbolic sine|{{sinh(0)}}|{{0}}|
> |{{cosh(_x_)}}|hyperbolic cosine|{{cosh(0)}}|{{1}}|
> |{{tanh(_x_)}}|hyperbolic tangent|{{tanh(0)}}|{{0}}|
> |{{asinh(_x_)}}|inverse hyperbolic sine|{{asinh(0)}}|{{0}}|
> |{{acosh(_x_)}}|inverse hyperbolic cosine|{{acosh(1)}}|{{0}}|
> |{{atanh(_x_)}}|inverse hyperbolic tangent|{{atanh(0)}}|{{0}}|
>  
>  
> [https://www.postgresql.org/docs/12/functions-math.html#FUNCTIONS-MATH-HYP-TABLE]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values

2019-06-24 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872015#comment-16872015
 ] 

Tony Zhang commented on SPARK-28135:


HIVE-21916

> ceil/ceiling/floor/power returns incorrect values
> -
>
> Key: SPARK-28135
> URL: https://issues.apache.org/jira/browse/SPARK-28135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> select ceil(double(1.2345678901234e+200)), 
> ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), 
> power('1', 'NaN');
> 9223372036854775807   9223372036854775807 9223372036854775807 NaN
> {noformat}
> {noformat}
> postgres=# select ceil(1.2345678901234e+200::float8), 
> ceiling(1.2345678901234e+200::float8), floor(1.2345678901234e+200::float8), 
> power('1', 'NaN');
>  ceil |   ceiling|floor | power
> --+--+--+---
>  1.2345678901234e+200 | 1.2345678901234e+200 | 1.2345678901234e+200 | 1
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values

2019-06-25 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872657#comment-16872657
 ] 

Tony Zhang commented on SPARK-28135:


As commented in the PR, to fix this overflow, we should not change Ceil return 
type to double, and thus will have to add support to 128-bit int type in the 
code base. It will be a different story.

> ceil/ceiling/floor/power returns incorrect values
> -
>
> Key: SPARK-28135
> URL: https://issues.apache.org/jira/browse/SPARK-28135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> select ceil(double(1.2345678901234e+200)), 
> ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), 
> power('1', 'NaN');
> 9223372036854775807   9223372036854775807 9223372036854775807 NaN
> {noformat}
> {noformat}
> postgres=# select ceil(1.2345678901234e+200::float8), 
> ceiling(1.2345678901234e+200::float8), floor(1.2345678901234e+200::float8), 
> power('1', 'NaN');
>  ceil |   ceiling|floor | power
> --+--+--+---
>  1.2345678901234e+200 | 1.2345678901234e+200 | 1.2345678901234e+200 | 1
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables

2019-07-04 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16878830#comment-16878830
 ] 

Tony Zhang commented on SPARK-28189:


Working on a fix now.

> Pyspark  - df.drop() is Case Sensitive when Referring to Upstream Tables
> 
>
> Key: SPARK-28189
> URL: https://issues.apache.org/jira/browse/SPARK-28189
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Luke
>Priority: Minor
>
> Column names in general are case insensitive in Pyspark, and df.drop() in 
> general is also case insensitive.
> However, when referring to an upstream table, such as from a join, e.g.
> {code:java}
> vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
> df1 = spark.createDataFrame(vals1, ['KEY','field'])
> vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
> df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])
> df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
> {code}
>  
> drop will become case sensitive. e.g.
> {code:java}
> # from above, df1 consists of columns ['KEY', 'field']
> # from above, df2 consists of columns ['KEY', 'CAPS']
> df_joined.select(df2['key']) # will give a result
> df_joined.drop('caps') # will also give a result
> {code}
> however, note the following
> {code:java}
> df_joined.drop(df2['key']) # no-op
> df_joined.drop(df2['caps']) # no-op
> df_joined.drop(df2['KEY']) # will drop column as expected
> df_joined.drop(df2['CAPS']) # will drop column as expected
> {code}
>  
>  
> so in summary, using df.drop(df2['col']) doesn't align with expected case 
> insensitivity for column names, even though functions like select, join, and 
> dropping a column generally are case insensitive.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28189) Pyspark - df.drop() is Case Sensitive when Referring to Upstream Tables

2019-07-06 Thread Tony Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879796#comment-16879796
 ] 

Tony Zhang commented on SPARK-28189:


[~dongjoon] thrilled. thanks for you help!

> Pyspark  - df.drop() is Case Sensitive when Referring to Upstream Tables
> 
>
> Key: SPARK-28189
> URL: https://issues.apache.org/jira/browse/SPARK-28189
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Luke
>Assignee: Tony Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>
> Column names in general are case insensitive in Pyspark, and df.drop() in 
> general is also case insensitive.
> However, when referring to an upstream table, such as from a join, e.g.
> {code:java}
> vals1 = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
> df1 = spark.createDataFrame(vals1, ['KEY','field'])
> vals2 = [('Rutabaga',1),('Pirate',2),('Ninja',3),('Darth Vader',4)]
> df2 = spark.createDataFrame(vals2, ['KEY','CAPS'])
> df_joined = df1.join(df2, df1['key'] == df2['key'], "left")
> {code}
>  
> drop will become case sensitive. e.g.
> {code:java}
> # from above, df1 consists of columns ['KEY', 'field']
> # from above, df2 consists of columns ['KEY', 'CAPS']
> df_joined.select(df2['key']) # will give a result
> df_joined.drop('caps') # will also give a result
> {code}
> however, note the following
> {code:java}
> df_joined.drop(df2['key']) # no-op
> df_joined.drop(df2['caps']) # no-op
> df_joined.drop(df2['KEY']) # will drop column as expected
> df_joined.drop(df2['CAPS']) # will drop column as expected
> {code}
>  
>  
> so in summary, using df.drop(df2['col']) doesn't align with expected case 
> insensitivity for column names, even though functions like select, join, and 
> dropping a column generally are case insensitive.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

2021-07-16 Thread Tony Zhang (Jira)
Tony Zhang created SPARK-36187:
--

 Summary: Commit collision avoidance in dynamicPartitionOverwrite 
for non-Parquet formats
 Key: SPARK-36187
 URL: https://issues.apache.org/jira/browse/SPARK-36187
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 3.1.2
Reporter: Tony Zhang


Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for 
parquet formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code. Does it mean that 
this new staging directory will not take effect for non-Parquet formats? Could 
that be a potential problem? or am I missing something here?

Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

2021-07-16 Thread Tony Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Zhang updated SPARK-36187:
---
   Shepherd: Wenchen Fan
Description: 
Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for 
parquet formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code. Does it mean that 
this new staging directory will not take effect for non-Parquet formats? Could 
that be a potential problem? or am I missing something here?

Thanks!

[~duripeng] [~dagrawal3409]

  was:
Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for 
parquet formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code. Does it mean that 
this new staging directory will not take effect for non-Parquet formats? Could 
that be a potential problem? or am I missing something here?

Thanks!


> Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet 
> formats
> ---
>
> Key: SPARK-36187
> URL: https://issues.apache.org/jira/browse/SPARK-36187
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tony Zhang
>Priority: Minor
>
> Hi, my question here is specifically about [PR 
> #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
> SPARK-29302.
> To my understanding, the PR is to introduce a different staging directory at 
> job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, 
> the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is 
> not null: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
>  however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for 
> parquet formats: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].
> However I didn't find similar behavior in Orc related code. Does it mean that 
> this new staging directory will not take effect for non-Parquet formats? 
> Could that be a potential problem? or am I missing something here?
> Thanks!
> [~duripeng] [~dagrawal3409]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

2021-07-16 Thread Tony Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Zhang updated SPARK-36187:
---
Description: 
Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet 
formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code to set that config. 
If I understand it correctly, without setting SQLConf.OUTPUT_COMMITTER_CLASS 
properly (like for Orc format), SQLHadoopMapReduceCommitProtocol will still use 
the original staging directory, which may void the fix by the PR, in which case 
the commit collision may still happen, thus the fix is now only effective for 
Parquet, but not for non-Parquet files.

Could someone confirm if it is a potential problem, or not? Thanks!

[~duripeng] [~dagrawal3409]

  was:
Hi, my question here is specifically about [PR 
#29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
SPARK-29302.

To my understanding, the PR is to introduce a different staging directory at 
job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, the 
new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is not 
null: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
 however in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for 
parquet formats: 
[code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].

However I didn't find similar behavior in Orc related code. Does it mean that 
this new staging directory will not take effect for non-Parquet formats? Could 
that be a potential problem? or am I missing something here?

Thanks!

[~duripeng] [~dagrawal3409]


> Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet 
> formats
> ---
>
> Key: SPARK-36187
> URL: https://issues.apache.org/jira/browse/SPARK-36187
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tony Zhang
>Priority: Minor
>
> Hi, my question here is specifically about [PR 
> #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
> SPARK-29302.
> To my understanding, the PR is to introduce a different staging directory at 
> job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, 
> the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is 
> not null: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
>  and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet 
> formats: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].
> However I didn't find similar behavior in Orc related code to set that 
> config. If I understand it correctly, without setting 
> SQLConf.OUTPUT_COMMITTER_CLASS properly (like for Orc format), 
> SQLHadoopMapReduceCommitProtocol will still use the original staging 
> directory, which may void the fix by the PR, in which case the commit 
> collision may still happen, thus the fix is now only effective for Parquet, 
> but not for non-Parquet files.
> Could someone confirm if it is a potential problem, or not? Thanks!
> [~duripeng] [~dagrawal3409]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36187) Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet formats

2021-07-20 Thread Tony Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17384344#comment-17384344
 ] 

Tony Zhang commented on SPARK-36187:


Thanks Hyukjin, will do.

> Commit collision avoidance in dynamicPartitionOverwrite for non-Parquet 
> formats
> ---
>
> Key: SPARK-36187
> URL: https://issues.apache.org/jira/browse/SPARK-36187
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tony Zhang
>Priority: Minor
>
> Hi, my question here is specifically about [PR 
> #29000|https://github.com/apache/spark/pull/29000/files#r649580767] for 
> SPARK-29302.
> To my understanding, the PR is to introduce a different staging directory at 
> job commit to avoid commit collision. In SQLHadoopMapReduceCommitProtocol, 
> the new staging directory is only set when SQLConf.OUTPUT_COMMITTER_CLASS is 
> not null: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/SQLHadoopMapReduceCommitProtocol.scala#L58],
>  and in current Spark repo, OUTPUT_COMMITTER_CLASS seems set only for parquet 
> formats: 
> [code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L96].
> However I didn't find similar behavior in Orc related code to set that 
> config. If I understand it correctly, without setting 
> SQLConf.OUTPUT_COMMITTER_CLASS properly (like for Orc format), 
> SQLHadoopMapReduceCommitProtocol will still use the original staging 
> directory, which may void the fix by the PR, in which case the commit 
> collision may still happen, thus the fix is now only effective for Parquet, 
> but not for non-Parquet files.
> Could someone confirm if it is a potential problem, or not? Thanks!
> [~duripeng] [~dagrawal3409]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org