[jira] [Commented] (SPARK-28605) Performance regression in SS's foreach

2019-08-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899790#comment-16899790
 ] 

Dongjoon Hyun commented on SPARK-28605:
---

Oh, thank you, [~zsxwing]!

> Performance regression in SS's foreach
> --
>
> Key: SPARK-28605
> URL: https://issues.apache.org/jira/browse/SPARK-28605
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: regresssion
>
> When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
> partition without reading data. But in ForeachSink v2, due to the API 
> limitation, it needs to read the whole partition even if all data just gets 
> dropped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28605) Performance regression in SS's foreach

2019-08-04 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899789#comment-16899789
 ] 

Shixiong Zhu commented on SPARK-28605:
--

By the way, this is not a critical regression. It's not easy to hit this issue.

> Performance regression in SS's foreach
> --
>
> Key: SPARK-28605
> URL: https://issues.apache.org/jira/browse/SPARK-28605
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: regresssion
>
> When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
> partition without reading data. But in ForeachSink v2, due to the API 
> limitation, it needs to read the whole partition even if all data just gets 
> dropped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-08-04 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899788#comment-16899788
 ] 

Liang-Chi Hsieh commented on SPARK-28422:
-

Thanks [~dongjoon]!


> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:python}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> spark.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> == Physical Plan ==
> *(1) Project [max_udf(id#0L) AS max_udf(id)#136]
> +- *(1) Range (0, 100, step=1, splits=4)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28605) Performance regression in SS's foreach

2019-08-04 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899787#comment-16899787
 ] 

Shixiong Zhu commented on SPARK-28605:
--

This is a regression at all 2.4 branches. It's caused by SPARK-23099.

> Performance regression in SS's foreach
> --
>
> Key: SPARK-28605
> URL: https://issues.apache.org/jira/browse/SPARK-28605
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: regresssion
>
> When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
> partition without reading data. But in ForeachSink v2, due to the API 
> limitation, it needs to read the whole partition even if all data just gets 
> dropped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28605) Performance regression in SS's foreach

2019-08-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-28605:
-
Affects Version/s: 2.4.0
   2.4.1
   2.4.2

> Performance regression in SS's foreach
> --
>
> Key: SPARK-28605
> URL: https://issues.apache.org/jira/browse/SPARK-28605
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: regresssion
>
> When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
> partition without reading data. But in ForeachSink v2, due to the API 
> limitation, it needs to read the whole partition even if all data just gets 
> dropped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28605) Performance regression in SS's foreach

2019-08-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899784#comment-16899784
 ] 

Dongjoon Hyun commented on SPARK-28605:
---

Hi, [~zsxwing]. Is this a regression at 2.4.3?

> Performance regression in SS's foreach
> --
>
> Key: SPARK-28605
> URL: https://issues.apache.org/jira/browse/SPARK-28605
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: regresssion
>
> When "ForeachWriter.open" return "false", ForeachSink v1 will skip the whole 
> partition without reading data. But in ForeachSink v2, due to the API 
> limitation, it needs to read the whole partition even if all data just gets 
> dropped.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-08-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899781#comment-16899781
 ] 

Dongjoon Hyun commented on SPARK-28422:
---

Thank you for reporting, [~icexelloss]. And, thank you for making a PR, 
[~viirya]. Since this is not supported from 2.4.0, I updated the affected 
versions, too.

> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:python}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> spark.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> == Physical Plan ==
> *(1) Project [max_udf(id#0L) AS max_udf(id)#136]
> +- *(1) Range (0, 100, step=1, splits=4)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-08-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28422:
--
Affects Version/s: 2.4.0

> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:python}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> spark.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> == Physical Plan ==
> *(1) Project [max_udf(id#0L) AS max_udf(id)#136]
> +- *(1) Range (0, 100, step=1, splits=4)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-08-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28422:
--
Description: 
 
{code:python}
from pyspark.sql.functions import pandas_udf, PandasUDFType
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
spark.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table").show(){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
 

  was:
 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
spark.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table").show(){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
 


> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.1, 2.4.2, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:python}
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> spark.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == 

[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-08-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28422:
--
Affects Version/s: 2.4.1

> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.1, 2.4.2, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:java}
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> spark.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> == Physical Plan ==
> *(1) Project [max_udf(id#0L) AS max_udf(id)#136]
> +- *(1) Range (0, 100, step=1, splits=4)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-08-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28422:
--
Affects Version/s: 2.4.2

> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.2, 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:java}
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> spark.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> == Physical Plan ==
> *(1) Project [max_udf(id#0L) AS max_udf(id)#136]
> +- *(1) Range (0, 100, step=1, splits=4)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28422) GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause

2019-08-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28422:
--
Description: 
 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
spark.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table").show(){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
 

  was:
 
{code:java}
@pandas_udf('double', PandasUDFType.GROUPED_AGG)
def max_udf(v):
return v.max()

df = spark.range(0, 100)
df.udf.register('max_udf', max_udf)
df.createTempView('table')

# A. This works
df.agg(max_udf(df['id'])).show()

# B. This doesn't work
spark.sql("select max_udf(id) from table").show(){code}
 

 

Query plan:

A:
{code:java}
== Parsed Logical Plan ==

'Aggregate [max_udf('id) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Analyzed Logical Plan ==

max_udf(id): double

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Optimized Logical Plan ==

Aggregate [max_udf(id#64L) AS max_udf(id)#140]

+- Range (0, 1000, step=1, splits=Some(4))




== Physical Plan ==

!AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]

+- Exchange SinglePartition

   +- *(1) Range (0, 1000, step=1, splits=4)
{code}
B:
{code:java}
== Parsed Logical Plan ==

'Project [unresolvedalias('max_udf('id), None)]

+- 'UnresolvedRelation [table]




== Analyzed Logical Plan ==

max_udf(id): double

Project [max_udf(id#0L) AS max_udf(id)#136]

+- SubqueryAlias `table`

   +- Range (0, 100, step=1, splits=Some(4))




== Optimized Logical Plan ==

Project [max_udf(id#0L) AS max_udf(id)#136]

+- Range (0, 100, step=1, splits=Some(4))




== Physical Plan ==

*(1) Project [max_udf(id#0L) AS max_udf(id)#136]

+- *(1) Range (0, 100, step=1, splits=4)
{code}
 


> GROUPED_AGG pandas_udf doesn't with spark.sql() without group by clause
> ---
>
> Key: SPARK-28422
> URL: https://issues.apache.org/jira/browse/SPARK-28422
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.3
>Reporter: Li Jin
>Priority: Major
>
>  
> {code:java}
> @pandas_udf('double', PandasUDFType.GROUPED_AGG)
> def max_udf(v):
> return v.max()
> df = spark.range(0, 100)
> spark.udf.register('max_udf', max_udf)
> df.createTempView('table')
> # A. This works
> df.agg(max_udf(df['id'])).show()
> # B. This doesn't work
> spark.sql("select max_udf(id) from table").show(){code}
>  
>  
> Query plan:
> A:
> {code:java}
> == Parsed Logical Plan ==
> 'Aggregate [max_udf('id) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Analyzed Logical Plan ==
> max_udf(id): double
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Aggregate [max_udf(id#64L) AS max_udf(id)#140]
> +- Range (0, 1000, step=1, splits=Some(4))
> == Physical Plan ==
> !AggregateInPandas [max_udf(id#64L)], [max_udf(id)#138 AS max_udf(id)#140]
> +- Exchange SinglePartition
>    +- *(1) Range (0, 1000, step=1, splits=4)
> {code}
> B:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [unresolvedalias('max_udf('id), None)]
> +- 'UnresolvedRelation [table]
> == Analyzed Logical Plan ==
> max_udf(id): double
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- SubqueryAlias `table`
>    +- Range (0, 100, step=1, splits=Some(4))
> == Optimized Logical Plan ==
> Project [max_udf(id#0L) AS max_udf(id)#136]
> +- Range (0, 100, step=1, splits=Some(4))
> == Physical Plan ==
> *(1) 

[jira] [Comment Edited] (SPARK-28580) ANSI SQL: unique predicate

2019-08-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899773#comment-16899773
 ] 

Dongjoon Hyun edited comment on SPARK-28580 at 8/5/19 5:08 AM:
---

Hi, [~beliefer]. Please make a PR with a working implementation. Thanks!


was (Author: dongjoon):
Please make a PR with a working implementation.

> ANSI SQL: unique predicate
> --
>
> Key: SPARK-28580
> URL: https://issues.apache.org/jira/browse/SPARK-28580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Format
> {code:java}
>  ::=
>  UNIQUE {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28580) ANSI SQL: unique predicate

2019-08-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899773#comment-16899773
 ] 

Dongjoon Hyun commented on SPARK-28580:
---

Please make a PR with a working implementation.

> ANSI SQL: unique predicate
> --
>
> Key: SPARK-28580
> URL: https://issues.apache.org/jira/browse/SPARK-28580
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Format
> {code:java}
>  ::=
>  UNIQUE {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs

2019-08-04 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz resolved SPARK-27661.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Resolved by [https://github.com/apache/spark/pull/24560]

> Add SupportsNamespaces interface for v2 catalogs
> 
>
> Key: SPARK-27661
> URL: https://issues.apache.org/jira/browse/SPARK-27661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> Some catalogs support namespace operations, like creating or dropping 
> namespaces. The v2 API should have a way to expose these operations to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27661) Add SupportsNamespaces interface for v2 catalogs

2019-08-04 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz reassigned SPARK-27661:
---

Assignee: Ryan Blue

> Add SupportsNamespaces interface for v2 catalogs
> 
>
> Key: SPARK-27661
> URL: https://issues.apache.org/jira/browse/SPARK-27661
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
>
> Some catalogs support namespace operations, like creating or dropping 
> namespaces. The v2 API should have a way to expose these operations to Spark.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28616) Improve merge-spark-pr script to warn WIP PRs and strip trailing dots

2019-08-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28616:
-

Assignee: Dongjoon Hyun

> Improve merge-spark-pr script to warn WIP PRs and strip trailing dots
> -
>
> Key: SPARK-28616
> URL: https://issues.apache.org/jira/browse/SPARK-28616
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to improve the `merge-spark-pr` script.
> - `[WIP]` is useful when we show that a PR is not ready for merge. Apache 
> Spark allows merging `WIP` PRs. However, sometime, we accidentally forgot to 
> clean up the title for the completed PRs. We had better warn once more during 
> merging stage and get a confirmation from the committers.
> - We have two kinds of PR title in terms of the ending period. This issue 
> aims to remove the trailing `dot` since the shorter is the better in the PR 
> title. Also, the PR titles without the trailing `dot` is dominant in the 
> commit logs.
> {code}
> $ git log --oneline | grep '[.]$' | wc -l
> 4090
> $ git log --oneline | grep '[^.]$' | wc -l
>20747
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28616) Improve merge-spark-pr script to warn WIP PRs and strip trailing dots

2019-08-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28616.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25356
[https://github.com/apache/spark/pull/25356]

> Improve merge-spark-pr script to warn WIP PRs and strip trailing dots
> -
>
> Key: SPARK-28616
> URL: https://issues.apache.org/jira/browse/SPARK-28616
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>
> This issue aims to improve the `merge-spark-pr` script.
> - `[WIP]` is useful when we show that a PR is not ready for merge. Apache 
> Spark allows merging `WIP` PRs. However, sometime, we accidentally forgot to 
> clean up the title for the completed PRs. We had better warn once more during 
> merging stage and get a confirmation from the committers.
> - We have two kinds of PR title in terms of the ending period. This issue 
> aims to remove the trailing `dot` since the shorter is the better in the PR 
> title. Also, the PR titles without the trailing `dot` is dominant in the 
> commit logs.
> {code}
> $ git log --oneline | grep '[.]$' | wc -l
> 4090
> $ git log --oneline | grep '[^.]$' | wc -l
>20747
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28617) Completely remove comments from the gold result file

2019-08-04 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28617:
---

 Summary: Completely remove comments from the gold result file
 Key: SPARK-28617
 URL: https://issues.apache.org/jira/browse/SPARK-28617
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28616) Improve merge-spark-pr script to warn WIP PRs and strip trailing dots

2019-08-04 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-28616:
-

 Summary: Improve merge-spark-pr script to warn WIP PRs and strip 
trailing dots
 Key: SPARK-28616
 URL: https://issues.apache.org/jira/browse/SPARK-28616
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This issue aims to improve the `merge-spark-pr` script.
- `[WIP]` is useful when we show that a PR is not ready for merge. Apache Spark 
allows merging `WIP` PRs. However, sometime, we accidentally forgot to clean up 
the title for the completed PRs. We had better warn once more during merging 
stage and get a confirmation from the committers.
- We have two kinds of PR title in terms of the ending period. This issue aims 
to remove the trailing `dot` since the shorter is the better in the PR title. 
Also, the PR titles without the trailing `dot` is dominant in the commit logs.
{code}
$ git log --oneline | grep '[.]$' | wc -l
4090
$ git log --oneline | grep '[^.]$' | wc -l
   20747
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28615) Add a guide line for dataframe functions to say column signature function is by default

2019-08-04 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-28615:
--

 Summary: Add a guide line for dataframe functions to say column 
signature function is by default
 Key: SPARK-28615
 URL: https://issues.apache.org/jira/browse/SPARK-28615
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 2.4.3
Reporter: Weichen Xu


Add a guide line for dataframe functions to say column signature function is by 
default. like:
{code}
This function APIs usually have methods with Column signature only because it 
can support not only Column but also other types such as a native string. The 
other variants currently exist for historical reasons.
{code}




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28614) Do not remove leading write space in the golden result file

2019-08-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28614.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25351
[https://github.com/apache/spark/pull/25351]

> Do not remove leading write space in the golden result file
> ---
>
> Key: SPARK-28614
> URL: https://issues.apache.org/jira/browse/SPARK-28614
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Do not remove leading write space in the golden result file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28614) Do not remove leading write space in the golden result file

2019-08-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28614:
-

Assignee: Yuming Wang

> Do not remove leading write space in the golden result file
> ---
>
> Key: SPARK-28614
> URL: https://issues.apache.org/jira/browse/SPARK-28614
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Do not remove leading write space in the golden result file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28604) Use log1p(x) instead of log(1+x) and expm1(x) instead of exp(x)-1

2019-08-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-28604.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25337
[https://github.com/apache/spark/pull/25337]

> Use log1p(x) instead of log(1+x) and expm1(x) instead of exp(x)-1
> -
>
> Key: SPARK-28604
> URL: https://issues.apache.org/jira/browse/SPARK-28604
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.3
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> Following https://issues.apache.org/jira/browse/SPARK-28519, I noticed that 
> there are number of places in the MLlib code that may be, for example, 
> computing log(1+x) where x is very small. All math libraries specialize for 
> this case with log1p(x) for better accuracy. Same for expm1. It shouldn't 
> hurt to use log1p/expm1 where possible and in a few cases I think it will 
> improve accuracy, as sometimes x is for example a very small probability.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28614) Do not remove leading write space in the golden result file

2019-08-04 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-28614:
---

 Summary: Do not remove leading write space in the golden result 
file
 Key: SPARK-28614
 URL: https://issues.apache.org/jira/browse/SPARK-28614
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Yuming Wang


Do not remove leading write space in the golden result file.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28527) Build a Test Framework for Thriftserver

2019-08-04 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899368#comment-16899368
 ] 

Yuming Wang edited comment on SPARK-28527 at 8/4/19 5:23 PM:
-

When building this test framework, found one inconsistent behavior:

[SPARK-28461] Thrift Server pad decimal numbers with trailing zeros to the 
scale of the column:


was (Author: q79969786):
When building this test framework, found one inconsistent behavior:

[SPARK-28461] Thrift Server pad decimal numbers with trailing zeros to the 
scale of the column:
!image-2019-08-03-13-58-06-106.png!

> Build a Test Framework for Thriftserver
> ---
>
> Key: SPARK-28527
> URL: https://issues.apache.org/jira/browse/SPARK-28527
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-28463 shows we need to improve 
> the overall test coverage for Thrift-server. 
> We can bulid a test framework that can directly re-run all the tests in 
> SQLQueryTestSuite via Thrift Server. This can help our test coverage of BI 
> use cases. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28527) Build a Test Framework for Thriftserver

2019-08-04 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-28527:

Attachment: (was: image-2019-08-03-13-58-06-106.png)

> Build a Test Framework for Thriftserver
> ---
>
> Key: SPARK-28527
> URL: https://issues.apache.org/jira/browse/SPARK-28527
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-28463 shows we need to improve 
> the overall test coverage for Thrift-server. 
> We can bulid a test framework that can directly re-run all the tests in 
> SQLQueryTestSuite via Thrift Server. This can help our test coverage of BI 
> use cases. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28471) Formatting dates with negative years

2019-08-04 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899657#comment-16899657
 ] 

Yuming Wang commented on SPARK-28471:
-

Thank you [~maxgekk] I see.

> Formatting dates with negative years
> 
>
> Key: SPARK-28471
> URL: https://issues.apache.org/jira/browse/SPARK-28471
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> While converting dates with negative years to strings, Spark skips era 
> sub-field by default. That's can confuse users since years from BC era are 
> mirrored to current era. For example:
> {code}
> spark-sql> select make_date(-44, 3, 15);
> 0045-03-15
> {code}
> Even negative years are out of supported range by the DATE type, it would be 
> nice to indicate the era for such dates.
> PostgreSQL outputs the era for such inputs:
> {code}
> # select make_date(-44, 3, 15);
>make_date   
> ---
>  0044-03-15 BC
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28609) Fix broken styles/links and make up-to-date

2019-08-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28609.
---
   Resolution: Fixed
Fix Version/s: 2.4.4
   3.0.0

Issue resolved by pull request 25345
[https://github.com/apache/spark/pull/25345]

> Fix broken styles/links and make up-to-date
> ---
>
> Key: SPARK-28609
> URL: https://issues.apache.org/jira/browse/SPARK-28609
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0, 2.4.4
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28609) Fix broken styles/links and make up-to-date

2019-08-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-28609:
-

Assignee: Dongjoon Hyun

> Fix broken styles/links and make up-to-date
> ---
>
> Key: SPARK-28609
> URL: https://issues.apache.org/jira/browse/SPARK-28609
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28471) Formatting dates with negative years

2019-08-04 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899633#comment-16899633
 ] 

Maxim Gekk edited comment on SPARK-28471 at 8/4/19 1:53 PM:


[~yumwang] There are 2 equal methods to output negative years (for BC era):
 # With sign '-'
 # or add suffix `BC`

The proposed PR outputs `-` for negative years. I have fixed this only Spark's 
formatters. In your example, you use standard formatter of *java.sql.Date*. For 
example, if you switch on Java 8 API:
{code:java}
scala> spark.conf.set("spark.sql.datetime.java8API.enabled", true)
scala> val df = spark.sql("select make_date(-44, 3, 15)")
df: org.apache.spark.sql.DataFrame = [make_date(-44, 3, 15): date]
scala> val date = df.collect.head.getAs[java.time.LocalDate](0)
date: java.time.LocalDate = -0044-03-15
{code}


was (Author: maxgekk):
[~yumwang] There are 2 equal methods to output negative years (for BC era):
 # With sign '-'
 # or add suffix `BC`

The proposed PR outputs `-` for negative years. I have fixed this only Spark's 
formatters. In your example, you uses standard formatter of java.sql.Date. For 
example, if you switch on Java 8 API:
{code:java}
scala> spark.conf.set("spark.sql.datetime.java8API.enabled", true)
scala> val df = spark.sql("select make_date(-44, 3, 15)")
df: org.apache.spark.sql.DataFrame = [make_date(-44, 3, 15): date]
scala> val date = df.collect.head.getAs[java.time.LocalDate](0)
date: java.time.LocalDate = -0044-03-15
{code}

> Formatting dates with negative years
> 
>
> Key: SPARK-28471
> URL: https://issues.apache.org/jira/browse/SPARK-28471
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> While converting dates with negative years to strings, Spark skips era 
> sub-field by default. That's can confuse users since years from BC era are 
> mirrored to current era. For example:
> {code}
> spark-sql> select make_date(-44, 3, 15);
> 0045-03-15
> {code}
> Even negative years are out of supported range by the DATE type, it would be 
> nice to indicate the era for such dates.
> PostgreSQL outputs the era for such inputs:
> {code}
> # select make_date(-44, 3, 15);
>make_date   
> ---
>  0044-03-15 BC
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28471) Formatting dates with negative years

2019-08-04 Thread Maxim Gekk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899633#comment-16899633
 ] 

Maxim Gekk commented on SPARK-28471:


[~yumwang] There are 2 equal methods to output negative years (for BC era):
 # With sign '-'
 # or add suffix `BC`

The proposed PR outputs `-` for negative years. I have fixed this only Spark's 
formatters. In your example, you uses standard formatter of java.sql.Date. For 
example, if you switch on Java 8 API:
{code:java}
scala> spark.conf.set("spark.sql.datetime.java8API.enabled", true)
scala> val df = spark.sql("select make_date(-44, 3, 15)")
df: org.apache.spark.sql.DataFrame = [make_date(-44, 3, 15): date]
scala> val date = df.collect.head.getAs[java.time.LocalDate](0)
date: java.time.LocalDate = -0044-03-15
{code}

> Formatting dates with negative years
> 
>
> Key: SPARK-28471
> URL: https://issues.apache.org/jira/browse/SPARK-28471
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> While converting dates with negative years to strings, Spark skips era 
> sub-field by default. That's can confuse users since years from BC era are 
> mirrored to current era. For example:
> {code}
> spark-sql> select make_date(-44, 3, 15);
> 0045-03-15
> {code}
> Even negative years are out of supported range by the DATE type, it would be 
> nice to indicate the era for such dates.
> PostgreSQL outputs the era for such inputs:
> {code}
> # select make_date(-44, 3, 15);
>make_date   
> ---
>  0044-03-15 BC
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28471) Formatting dates with negative years

2019-08-04 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899630#comment-16899630
 ] 

Yuming Wang commented on SPARK-28471:
-

Hi [~maxgekk], It seems the issue still exist:
{code:scala}
scala> val df = spark.sql("select make_date(-44, 3, 15)")
df: org.apache.spark.sql.DataFrame = [make_date(-44, 3, 15): date]

scala> df.show
+-+
|make_date(-44, 3, 15)|
+-+
|  -0044-03-15|
+-+


scala> df.collect.head.getDate(0)
res1: java.sql.Date = 0045-03-17
{code}


{code:sql}
spark-sql> select make_date(-44, 3, 15);
-0044-03-15
{code}

> Formatting dates with negative years
> 
>
> Key: SPARK-28471
> URL: https://issues.apache.org/jira/browse/SPARK-28471
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> While converting dates with negative years to strings, Spark skips era 
> sub-field by default. That's can confuse users since years from BC era are 
> mirrored to current era. For example:
> {code}
> spark-sql> select make_date(-44, 3, 15);
> 0045-03-15
> {code}
> Even negative years are out of supported range by the DATE type, it would be 
> nice to indicate the era for such dates.
> PostgreSQL outputs the era for such inputs:
> {code}
> # select make_date(-44, 3, 15);
>make_date   
> ---
>  0044-03-15 BC
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28613) Spark SQL action collect just judge size of compressed RDD's size, not accurate enough

2019-08-04 Thread angerszhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-28613:
--
Description: 
When we run action DataFrame.collect() , for the configuration 
*spark.driver.maxResultSize  ,*when determine if the returned data exceeds the 
limit, it will use the compressed byte array's size, it is not accurate. 

Since when we get data when use SparkThriftServer, when not use incremental 
colletion. It will get all data of datafrme for each partition.

For return data, it has the preocess"
 # compress data's byte array 
 # Being packaged as ResultSet
 # return to driver and judge by *spark.Driver.resultMaxSize*
 # *decode(uncompress) data as Array[Row]*

The amount of data unzipped differs significantly from the amount of data 
unzipped, The difference in the size of the data is more than ten times

 

 

  was:When we run action DataFrame.collect() , for the configuration 
*spark.driver.maxResultSize  ,*when determine if the returned data exceeds the 
limit, it will use the compressed byte array's size, it it not 


> Spark SQL action collect just judge size of compressed RDD's size, not 
> accurate enough
> --
>
> Key: SPARK-28613
> URL: https://issues.apache.org/jira/browse/SPARK-28613
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
>
> When we run action DataFrame.collect() , for the configuration 
> *spark.driver.maxResultSize  ,*when determine if the returned data exceeds 
> the limit, it will use the compressed byte array's size, it is not accurate. 
> Since when we get data when use SparkThriftServer, when not use incremental 
> colletion. It will get all data of datafrme for each partition.
> For return data, it has the preocess"
>  # compress data's byte array 
>  # Being packaged as ResultSet
>  # return to driver and judge by *spark.Driver.resultMaxSize*
>  # *decode(uncompress) data as Array[Row]*
> The amount of data unzipped differs significantly from the amount of data 
> unzipped, The difference in the size of the data is more than ten times
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28613) Spark SQL action collect just judge size of compressed RDD's size, not accurate enough

2019-08-04 Thread angerszhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-28613:
--
Description: When we run action DataFrame.collect() , for the configuration 
*spark.driver.maxResultSize  ,*when determine if the returned data exceeds the 
limit, it will use the compressed byte array's size, it it not 

> Spark SQL action collect just judge size of compressed RDD's size, not 
> accurate enough
> --
>
> Key: SPARK-28613
> URL: https://issues.apache.org/jira/browse/SPARK-28613
> Project: Spark
>  Issue Type: Wish
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
>
> When we run action DataFrame.collect() , for the configuration 
> *spark.driver.maxResultSize  ,*when determine if the returned data exceeds 
> the limit, it will use the compressed byte array's size, it it not 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28613) Spark SQL action collect just judge size of compressed RDD's size, not accurate enough

2019-08-04 Thread angerszhu (JIRA)
angerszhu created SPARK-28613:
-

 Summary: Spark SQL action collect just judge size of compressed 
RDD's size, not accurate enough
 Key: SPARK-28613
 URL: https://issues.apache.org/jira/browse/SPARK-28613
 Project: Spark
  Issue Type: Wish
  Components: SQL
Affects Versions: 2.4.0
Reporter: angerszhu






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28611) Histogram's height is diffrent

2019-08-04 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899585#comment-16899585
 ] 

Marco Gaido commented on SPARK-28611:
-

Mmmhthat's weird! How can you get a different result than what was got on 
the CI? How are you running the code?

> Histogram's height is diffrent
> --
>
> Key: SPARK-28611
> URL: https://issues.apache.org/jira/browse/SPARK-28611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> CREATE TABLE desc_col_table (key int COMMENT 'column_comment') USING PARQUET;
> -- Test output for histogram statistics
> SET spark.sql.statistics.histogram.enabled=true;
> SET spark.sql.statistics.histogram.numBins=2;
> INSERT INTO desc_col_table values 1, 2, 3, 4;
> ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key;
> DESC EXTENDED desc_col_table key;
> {code}
> {noformat}
> spark-sql> DESC EXTENDED desc_col_table key;
> col_name  key
> data_type int
> comment   column_comment
> min   1
> max   4
> num_nulls 0
> distinct_count4
> avg_col_len   4
> max_col_len   4
> histogram height: 4.0, num_of_bins: 2
> bin_0 lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2
> bin_1 lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2
> {noformat}
> But our result is:
> https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/resources/sql-tests/results/describe-table-column.sql.out#L231-L242



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24091) Internally used ConfigMap prevents use of user-specified ConfigMaps carrying Spark configs files

2019-08-04 Thread Michael Gendelman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899578#comment-16899578
 ] 

Michael Gendelman commented on SPARK-24091:
---

[~tmckay] We are experiencing the same issue and your solution looks great but 
I'm not sure how to implement it. You write:
{quote}One way to handle this is allow a user-specified ConfigMap to be mounted 
at another location
{quote}
How can we mount a user-specified ConfigMap when we can't control the Pod 
specification?

> Internally used ConfigMap prevents use of user-specified ConfigMaps carrying 
> Spark configs files
> 
>
> Key: SPARK-24091
> URL: https://issues.apache.org/jira/browse/SPARK-24091
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Yinan Li
>Priority: Major
>
> The recent PR [https://github.com/apache/spark/pull/20669] for removing the 
> init-container introduced a internally used ConfigMap carrying Spark 
> configuration properties in a file for the driver. This ConfigMap gets 
> mounted under {{$SPARK_HOME/conf}} and the environment variable 
> {{SPARK_CONF_DIR}} is set to point to the mount path. This pretty much 
> prevents users from mounting their own ConfigMaps that carry custom Spark 
> configuration files, e.g., {{log4j.properties}} and {{spark-env.sh}} and 
> leaves users with only the option of building custom images. IMO, it is very 
> useful to support mounting user-specified ConfigMaps for custom Spark 
> configuration files. This worths further discussions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28411) insertInto with overwrite inconsistent behaviour Python/Scala

2019-08-04 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-28411:

Labels: release-notes  (was: )

> insertInto with overwrite inconsistent behaviour Python/Scala
> -
>
> Key: SPARK-28411
> URL: https://issues.apache.org/jira/browse/SPARK-28411
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.1, 2.4.0
>Reporter: Maria Rebelka
>Assignee: Huaxin Gao
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The df.write.mode("overwrite").insertInto("table") has inconsistent behaviour 
> between Scala and Python. In Python, insertInto ignores "mode" parameter and 
> appends by default. Only when changing syntax to df.write.insertInto("table", 
> overwrite=True) we get expected behaviour.
> This is a native Spark syntax, expected to be the same between languages... 
> Also, in other write methods, like saveAsTable or write.parquet "mode" seem 
> to be respected.
> Reproduce, Python, ignore "overwrite":
> {code:java}
> df = spark.createDataFrame(sc.parallelize([(1, 2),(3,4)]),['i','j'])
> # create the table and load data
> df.write.saveAsTable("spark_overwrite_issue")
> # insert overwrite, expected result - 2 rows
> df.write.mode("overwrite").insertInto("spark_overwrite_issue")
> spark.sql("select * from spark_overwrite_issue").count()
> # result - 4 rows, insert appended data instead of overwrite{code}
> Reproduce, Scala, works as expected:
> {code:java}
> val df = Seq((1, 2),(3,4)).toDF("i","j")
> df.write.mode("overwrite").insertInto("spark_overwrite_issue")
> spark.sql("select * from spark_overwrite_issue").count()
> # result - 2 rows{code}
> Tested on Spark 2.2.1 (EMR) and 2.4.0 (Databricks)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org