[jira] [Commented] (SPARK-22456) Add new function dayofweek

2017-11-06 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16240331#comment-16240331
 ] 

Michael Styles commented on SPARK-22456:


This functionality is part of the SQL standard and supported by a number of 
relational database systems including Oracle, DB2, SQL Server, MySQL, and 
Teradata, to name a few.

> Add new function dayofweek
> --
>
> Key: SPARK-22456
> URL: https://issues.apache.org/jira/browse/SPARK-22456
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Michael Styles
>
> Add new function *dayofweek* to return the day of the week of the given 
> argument as an integer value in the range 1-7, where 1 represents Sunday.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22456) Add new function dayofweek

2017-11-06 Thread Michael Styles (JIRA)
Michael Styles created SPARK-22456:
--

 Summary: Add new function dayofweek
 Key: SPARK-22456
 URL: https://issues.apache.org/jira/browse/SPARK-22456
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Michael Styles


Add new function *dayofweek* to return the day of the week of the given 
argument as an integer value in the range 1-7, where 1 represents Sunday.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21692) Modify PythonUDF to support nullability

2017-08-11 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123461#comment-16123461
 ] 

Michael Styles commented on SPARK-21692:


https://github.com/apache/spark/pull/18906

> Modify PythonUDF to support nullability
> ---
>
> Key: SPARK-21692
> URL: https://issues.apache.org/jira/browse/SPARK-21692
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Michael Styles
>
> When creating or registering Python UDFs, a user may know whether null values 
> can be returned by the function. PythonUDF and related classes should be 
> modified to support nullability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21692) Modify PythonUDF to support nullability

2017-08-10 Thread Michael Styles (JIRA)
Michael Styles created SPARK-21692:
--

 Summary: Modify PythonUDF to support nullability
 Key: SPARK-21692
 URL: https://issues.apache.org/jira/browse/SPARK-21692
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Michael Styles


When creating or registering Python UDFs, a user may know whether null values 
can be returned by the function. PythonUDF and related classes should be 
modified to support nullability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2017-06-28 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16066363#comment-16066363
 ] 

Michael Styles commented on SPARK-17091:


n Parquet 1.7, there as a bug involving corrupt statistics on binary columns 
(https://issues.apache.org/jira/browse/PARQUET-251). This bug prevented earlier 
versions of Spark from generating Parquet filters on any string columns. Spark 
2.1 has moved up to Parquet 1.8.2, so this issue no longer exists.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles resolved SPARK-21218.

Resolution: Duplicate

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065413#comment-16065413
 ] 

Michael Styles edited comment on SPARK-17091 at 6/27/17 8:26 PM:
-

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement. See attachments.


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-17091:
---
Summary: Convert IN predicate to equivalent Parquet filter  (was: 
ParquetFilters rewrite IN to OR of Eq)

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-17091:
---
Attachment: IN Predicate.png
OR Predicate.png

> ParquetFilters rewrite IN to OR of Eq
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065413#comment-16065413
 ] 

Michael Styles commented on SPARK-17091:


By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement.

> ParquetFilters rewrite IN to OR of Eq
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles reopened SPARK-17091:


> ParquetFilters rewrite IN to OR of Eq
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065210#comment-16065210
 ] 

Michael Styles commented on SPARK-21218:


In Parquet 1.7, there as a bug involving corrupt statistics on binary columns 
(https://issues.apache.org/jira/browse/PARQUET-251). This bug prevented earlier 
versions of Spark from generating Parquet filters on any string columns. Spark 
2.1 has moved up to Parquet 1.8.2, so this issue no longer exists.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16065142#comment-16065142
 ] 

Michael Styles commented on SPARK-21218:


[~hyukjin.kwon] Not sure I understand what you want me to do with my PR, 
assuming this JIRA is resolved as a duplicate?

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064608#comment-16064608
 ] 

Michael Styles edited comment on SPARK-21218 at 6/27/17 12:17 PM:
--

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement.


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

Notice the difference in the number of output rows for the scans (see 
attachments). Also, the IN predicate test took about 1.1 minutes, while the OR 
predicate test took about 16 seconds. 

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064608#comment-16064608
 ] 

Michael Styles edited comment on SPARK-21218 at 6/27/17 10:30 AM:
--

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

Notice the difference in the number of output rows for the scans (see 
attachments). Also, the IN predicate test took about 1.1 minutes, while the OR 
predicate test took about 16 seconds. 


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}
!IN Predicate.png|thumbnail!

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}
!OR Predicate.png|thumbnail!

Notice the difference in the number of output rows for the scan. 

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064608#comment-16064608
 ] 

Michael Styles edited comment on SPARK-21218 at 6/27/17 10:25 AM:
--

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}
!IN Predicate.png|thumbnail!

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}
!OR Predicate.png|thumbnail!

Notice the difference in the number of output rows for the scan. 


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}
!IN Predicate.png|thumbnail!

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}
!OR Pedicate.png|thumbnail!

Notice the difference in the number of output rows for the scan. 

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064608#comment-16064608
 ] 

Michael Styles commented on SPARK-21218:


By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}
!IN Predicate.png|thumbnail!

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}
!OR Pedicate.png|thumbnail!

Notice the difference in the number of output rows for the scan. 

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-21218:
---
Attachment: (was: Starscream Console on 
OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 
06-02-53.png)

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-21218:
---
Attachment: (was: Starscream Console on 
OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 2017-06-27 
06-03-29.png)

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-21218:
---
Attachment: IN Predicate.png
OR Predicate.png

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png, Starscream Console 
> on OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 
> 2017-06-27 06-03-29.png, Starscream Console on 
> OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 
> 06-02-53.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-21218:
---
Attachment: Starscream Console on 
OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 
06-02-53.png
Starscream Console on 
OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 2017-06-27 
06-03-29.png

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: Starscream Console on 
> OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 2017-06-27 
> 06-03-29.png, Starscream Console on 
> OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 
> 06-02-53.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-26 Thread Michael Styles (JIRA)
Michael Styles created SPARK-21218:
--

 Summary: Convert IN predicate to equivalent Parquet filter
 Key: SPARK-21218
 URL: https://issues.apache.org/jira/browse/SPARK-21218
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 2.1.1
Reporter: Michael Styles


Convert IN predicate to equivalent expression involving equality conditions to 
allow the filter to be pushed down to Parquet.

For instance,

C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20636) Eliminate unnecessary shuffle with adjacent Window expressions

2017-05-08 Thread Michael Styles (JIRA)
Michael Styles created SPARK-20636:
--

 Summary: Eliminate unnecessary shuffle with adjacent Window 
expressions
 Key: SPARK-20636
 URL: https://issues.apache.org/jira/browse/SPARK-20636
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 2.1.1
Reporter: Michael Styles


Consider the following example:

{noformat}
w1 = Window.partitionBy("sno")
w2 = Window.partitionBy("sno", "pno")

supply \
.select('sno', 'pno', 'qty', F.sum('qty').over(w2).alias('sum_qty_2')) \
.select('sno', 'pno', 'qty', F.col('sum_qty_2'), 
F.sum('qty').over(w1).alias('sum_qty_1')) \
.explain()

== Optimized Logical Plan ==
Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1112L], [sno#980]
+- Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1105L], [sno#980, 
pno#981]
   +- Relation[sno#980,pno#981,qty#982L] parquet

== Physical Plan ==
Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1112L], [sno#980]
+- *Sort [sno#980 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(sno#980, 200)
  +- Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS 
BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1105L], 
[sno#980, pno#981]
 +- *Sort [sno#980 ASC NULLS FIRST, pno#981 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(sno#980, pno#981, 200)
   +- *FileScan parquet [sno#980,pno#981,qty#982L] ...
{noformat}

A more efficient query plan can be achieved by flipping the Window expressions 
to eliminate an unnecessary shuffle as follows:

{noformat}
== Optimized Logical Plan ==
Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1087L], [sno#980, 
pno#981]
+- Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN UNBOUNDED 
PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1085L], [sno#980]
   +- Relation[sno#980,pno#981,qty#982L] parquet

== Physical Plan ==
Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1087L], [sno#980, 
pno#981]
+- *Sort [sno#980 ASC NULLS FIRST, pno#981 ASC NULLS FIRST], false, 0
   +- Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN 
UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1085L], [sno#980]
  +- *Sort [sno#980 ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(sno#980, 200)
+- *FileScan parquet [sno#980,pno#981,qty#982L] ...
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20463) Add support for IS [NOT] DISTINCT FROM to SPARK SQL

2017-04-30 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-20463:
---
Description: 
Add support for the SQL standard distinct predicate to SPARK SQL.

{noformat}
 IS [NOT] DISTINCT FROM 
{noformat}

{noformat}
data = [(10, 20), (30, 30), (40, None), (None, None)]
df = sc.parallelize(data).toDF(["c1", "c2"])
df.createTempView("df")
spark.sql("select c1, c2 from df where c1 is not distinct from c2").collect()
[Row(c1=30, c2=30), Row(c1=None, c2=None)]
{noformat}


  was:
Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
*isNotDistinctFrom*. For example:

{panel}
{noformat}
data = [(10, 20), (30, 30), (40, None), (None, None)]
df2 = sc.parallelize(data).toDF("c1", "c2")
df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
[Row(c1=30, c2=30), Row(c1=None, c2=None)]
{noformat}
{panel}


> Add support for IS [NOT] DISTINCT FROM to SPARK SQL
> ---
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Add support for the SQL standard distinct predicate to SPARK SQL.
> {noformat}
>  IS [NOT] DISTINCT FROM 
> {noformat}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df = sc.parallelize(data).toDF(["c1", "c2"])
> df.createTempView("df")
> spark.sql("select c1, c2 from df where c1 is not distinct from c2").collect()
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20463) Add support for IS [NOT] DISTINCT FROM to SPARK SQL

2017-04-30 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-20463:
---
Component/s: (was: PySpark)
 SQL
Summary: Add support for IS [NOT] DISTINCT FROM to SPARK SQL  (was: 
Expose SPARK SQL <=> operator in PySpark)

> Add support for IS [NOT] DISTINCT FROM to SPARK SQL
> ---
>
> Key: SPARK-20463
> URL: https://issues.apache.org/jira/browse/SPARK-20463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
> *isNotDistinctFrom*. For example:
> {panel}
> {noformat}
> data = [(10, 20), (30, 30), (40, None), (None, None)]
> df2 = sc.parallelize(data).toDF("c1", "c2")
> df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
> [Row(c1=30, c2=30), Row(c1=None, c2=None)]
> {noformat}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark

2017-04-25 Thread Michael Styles (JIRA)
Michael Styles created SPARK-20463:
--

 Summary: Expose SPARK SQL <=> operator in PySpark
 Key: SPARK-20463
 URL: https://issues.apache.org/jira/browse/SPARK-20463
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Michael Styles


Expose the SPARK SQL '<=>' operator in Pyspark as a column function called 
*isNotDistinctFrom*. For example:

{panel}
{noformat}
data = [(10, 20), (30, 30), (40, None), (None, None)]
df2 = sc.parallelize(data).toDF("c1", "c2")
df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect())
[Row(c1=30, c2=30), Row(c1=None, c2=None)]
{noformat}
{panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20413) New Optimizer Hint to prevent collapsing of adjacent projections

2017-04-20 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-20413:
---
Description: 
I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This 
hint is essentially identical to Oracle's NO_MERGE hint. 

Let me first give an example of why I am proposing this. 

{noformat}
df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) 
df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
df2["ua"].browser_version.alias("c2")) 
df3.explain(True) 

== Parsed Logical Plan == 
'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Analyzed Logical Plan == 
c1: string, c2: string 
Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Optimized Logical Plan == 
Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- LogicalRDD [id#80L, user_agent#81] 

== Physical Plan == 
*Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- Scan ExistingRDD[id#80L,user_agent#81] 
{noformat}

user_agent_details is a user-defined function that returns a struct. As can be 
seen from the generated query plan, the function is being executed multiple 
times which could lead to performance issues. This is due to the 
CollapseProject optimizer rule that collapses adjacent projections. 

I'm proposing a hint that will prevent the optimizer from collapsing adjacent 
projections. A new function called 'no_collapse' would be introduced for this 
purpose. Consider the following example and generated query plan. 

{noformat}
df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
df2 = F.no_collapse(df1.withColumn("ua", 
user_agent_details(df1["user_agent"]))) 
df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
df2["ua"].browser_version.alias("c2")) 
df3.explain(True) 

== Parsed Logical Plan == 
'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS c2#76] 
+- NoCollapseHint 
   +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Analyzed Logical Plan == 
c1: string, c2: string 
Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- NoCollapseHint 
   +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Optimized Logical Plan == 
Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- NoCollapseHint 
   +- Project [UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Physical Plan == 
*Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- *Project [UDF(user_agent#65) AS ua#69] 
   +- Scan ExistingRDD[id#64L,user_agent#65] 
{noformat}

As can be seen from the query plan, the user-defined function is now evaluated 
once per row.


  was:
I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This 
hint is essentially identical to Oracle's NO_MERGE hint. 

Let me first give an example of why I am proposing this. 

{noformat}
df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) 
df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
df2["ua"].browser_version.alias("c2")) 
df3.explain(True) 

== Parsed Logical Plan == 
'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Analyzed Logical Plan == 
c1: string, c2: string 
Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Optimized Logical Plan == 
Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- LogicalRDD [id#80L, user_agent#81] 

== Physical Plan == 
*Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- Scan ExistingRDD[id#80L,user_agent#81] 
{noformat}

user_agent_details is a user-defined function that returns a struct. As can be 
seen from the generated query plan, the function is being executed multiple 
times which could lead to performance issues. This is due to the 
CollapseProject optimizer rule that collapses adjacent projections. 

I'm proposing a hint that will prevent the optimizer from collapsing adjacent 
projections. A new function called 

[jira] [Updated] (SPARK-20413) New Optimizer Hint to prevent collapsing of adjacent projections

2017-04-20 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-20413:
---
Description: 
I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This 
hint is essentially identical to Oracle's NO_MERGE hint. 

Let me first give an example of why I am proposing this. 

{noformat}
df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) 
df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
df2["ua"].browser_version.alias("c2")) 
df3.explain(True) 

== Parsed Logical Plan == 
'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Analyzed Logical Plan == 
c1: string, c2: string 
Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Optimized Logical Plan == 
Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- LogicalRDD [id#80L, user_agent#81] 

== Physical Plan == 
*Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- Scan ExistingRDD[id#80L,user_agent#81] 
{noformat}

user_agent_details is a user-defined function that returns a struct. As can be 
seen from the generated query plan, the function is being executed multiple 
times which could lead to performance issues. This is due to the 
CollapseProject optimizer rule that collapses adjacent projections. 

I'm proposing a hint that will prevent the optimizer from collapsing adjacent 
projections. A new function called 'no_collapse' would be introduced for this 
purpose. Consider the following example and generated query plan. 

df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
df2 = F.no_collapse(df1.withColumn("ua", 
user_agent_details(df1["user_agent"]))) 
df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
df2["ua"].browser_version.alias("c2")) 
df3.explain(True) 

== Parsed Logical Plan == 
'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS c2#76] 
+- NoCollapseHint 
   +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Analyzed Logical Plan == 
c1: string, c2: string 
Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- NoCollapseHint 
   +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Optimized Logical Plan == 
Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- NoCollapseHint 
   +- Project [UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Physical Plan == 
*Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- *Project [UDF(user_agent#65) AS ua#69] 
   +- Scan ExistingRDD[id#64L,user_agent#65] 

As can be seen from the query plan, the user-defined function is now evaluated 
once per row.


  was:
I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This 
hint is essentially identical to Oracle's NO_MERGE hint. 

Let me first give an example of why I am proposing this. 

df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) 
df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
df2["ua"].browser_version.alias("c2")) 
df3.explain(True) 

== Parsed Logical Plan == 
'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Analyzed Logical Plan == 
c1: string, c2: string 
Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Optimized Logical Plan == 
Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- LogicalRDD [id#80L, user_agent#81] 

== Physical Plan == 
*Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- Scan ExistingRDD[id#80L,user_agent#81] 

user_agent_details is a user-defined function that returns a struct. As can be 
seen from the generated query plan, the function is being executed multiple 
times which could lead to performance issues. This is due to the 
CollapseProject optimizer rule that collapses adjacent projections. 

I'm proposing a hint that will prevent the optimizer from collapsing adjacent 
projections. A new function called 'no_collapse' would be introduced for this 
purpose. 

[jira] [Updated] (SPARK-20413) New Optimizer Hint to prevent collapsing of adjacent projections

2017-04-20 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-20413:
---
Description: 
I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This 
hint is essentially identical to Oracle's NO_MERGE hint. 

Let me first give an example of why I am proposing this. 

df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) 
df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
df2["ua"].browser_version.alias("c2")) 
df3.explain(True) 

== Parsed Logical Plan == 
'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Analyzed Logical Plan == 
c1: string, c2: string 
Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Optimized Logical Plan == 
Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- LogicalRDD [id#80L, user_agent#81] 

== Physical Plan == 
*Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- Scan ExistingRDD[id#80L,user_agent#81] 

user_agent_details is a user-defined function that returns a struct. As can be 
seen from the generated query plan, the function is being executed multiple 
times which could lead to performance issues. This is due to the 
CollapseProject optimizer rule that collapses adjacent projections. 

I'm proposing a hint that will prevent the optimizer from collapsing adjacent 
projections. A new function called 'no_collapse' would be introduced for this 
purpose. Consider the following example and generated query plan. 

df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
df2 = F.no_collapse(df1.withColumn("ua", 
user_agent_details(df1["user_agent"]))) 
df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
df2["ua"].browser_version.alias("c2")) 
df3.explain(True) 

== Parsed Logical Plan == 
'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS c2#76] 
+- NoCollapseHint 
   +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Analyzed Logical Plan == 
c1: string, c2: string 
Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- NoCollapseHint 
   +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Optimized Logical Plan == 
Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- NoCollapseHint 
   +- Project [UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Physical Plan == 
*Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- *Project [UDF(user_agent#65) AS ua#69] 
   +- Scan ExistingRDD[id#64L,user_agent#65] 

As can be seen from the query plan, the user-defined function is now evaluated 
once per row.


  was:
I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This 
hint is essentially identical to Oracle's NO_MERGE hint. 

Let me first give an example of why I am proposing this. 

df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) 
df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
df2["ua"].browser_version.alias("c2")) 
df3.explain(True) 

== Parsed Logical Plan == 
'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Analyzed Logical Plan == 
c1: string, c2: string 
Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Optimized Logical Plan == 
Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- LogicalRDD [id#80L, user_agent#81] 

== Physical Plan == 
*Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- Scan ExistingRDD[id#80L,user_agent#81] 

user_agent_details is a user-defined function that returns a struct. As can be 
seen from the generated query plan, the function is being executed multiple 
times which could lead to performance issues. This is due to the 
CollapseProject optimizer rule that collapses adjacent projections. 

I'm proposing a hint that prevent the optimizer from collapsing adjacent 
projections. A new function called 'no_collapse' would be introduced for this 
purpose. Consider the following 

[jira] [Created] (SPARK-20413) New Optimizer Hint to prevent collapsing of adjacent projections

2017-04-20 Thread Michael Styles (JIRA)
Michael Styles created SPARK-20413:
--

 Summary: New Optimizer Hint to prevent collapsing of adjacent 
projections
 Key: SPARK-20413
 URL: https://issues.apache.org/jira/browse/SPARK-20413
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer, PySpark, SQL
Affects Versions: 2.1.0
Reporter: Michael Styles


I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This 
hint is essentially identical to Oracle's NO_MERGE hint. 

Let me first give an example of why I am proposing this. 

df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) 
df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
df2["ua"].browser_version.alias("c2")) 
df3.explain(True) 

== Parsed Logical Plan == 
'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Analyzed Logical Plan == 
c1: string, c2: string 
Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] 
+- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] 
   +- LogicalRDD [id#80L, user_agent#81] 

== Optimized Logical Plan == 
Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- LogicalRDD [id#80L, user_agent#81] 

== Physical Plan == 
*Project [UDF(user_agent#81).device_form_factor AS c1#90, 
UDF(user_agent#81).browser_version AS c2#91] 
+- Scan ExistingRDD[id#80L,user_agent#81] 

user_agent_details is a user-defined function that returns a struct. As can be 
seen from the generated query plan, the function is being executed multiple 
times which could lead to performance issues. This is due to the 
CollapseProject optimizer rule that collapses adjacent projections. 

I'm proposing a hint that prevent the optimizer from collapsing adjacent 
projections. A new function called 'no_collapse' would be introduced for this 
purpose. Consider the following example and generated query plan. 

df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) 
df2 = F.no_collapse(df1.withColumn("ua", 
user_agent_details(df1["user_agent"]))) 
df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), 
df2["ua"].browser_version.alias("c2")) 
df3.explain(True) 

== Parsed Logical Plan == 
'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS c2#76] 
+- NoCollapseHint 
   +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Analyzed Logical Plan == 
c1: string, c2: string 
Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- NoCollapseHint 
   +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Optimized Logical Plan == 
Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- NoCollapseHint 
   +- Project [UDF(user_agent#65) AS ua#69] 
  +- LogicalRDD [id#64L, user_agent#65] 

== Physical Plan == 
*Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] 
+- *Project [UDF(user_agent#65) AS ua#69] 
   +- Scan ExistingRDD[id#64L,user_agent#65] 

As can be seen from the query plan, the user-defined function is now evaluated 
once per row.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20350) Apply Complementation Laws during boolean expression simplification

2017-04-16 Thread Michael Styles (JIRA)
Michael Styles created SPARK-20350:
--

 Summary: Apply Complementation Laws during boolean expression 
simplification
 Key: SPARK-20350
 URL: https://issues.apache.org/jira/browse/SPARK-20350
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 2.1.0
Reporter: Michael Styles


Apply Complementation Laws during boolean expression simplification.

* A AND NOT(A) == FALSE
* A OR NOT(A) == TRUE



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-07 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-19851:
---
Description: 
Add support for EVERY and ANY (SOME) aggregates.

- EVERY returns true if all input values are true.
- ANY returns true if at least one input value is true.
- SOME is equivalent to ANY.

Both aggregates are part of the SQL standard.

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>
> Add support for EVERY and ANY (SOME) aggregates.
> - EVERY returns true if all input values are true.
> - ANY returns true if at least one input value is true.
> - SOME is equivalent to ANY.
> Both aggregates are part of the SQL standard.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-07 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899988#comment-15899988
 ] 

Michael Styles commented on SPARK-19851:


https://github.com/apache/spark/pull/17194

> Add support for EVERY and ANY (SOME) aggregates
> ---
>
> Key: SPARK-19851
> URL: https://issues.apache.org/jira/browse/SPARK-19851
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL
>Affects Versions: 2.1.0
>Reporter: Michael Styles
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates

2017-03-07 Thread Michael Styles (JIRA)
Michael Styles created SPARK-19851:
--

 Summary: Add support for EVERY and ANY (SOME) aggregates
 Key: SPARK-19851
 URL: https://issues.apache.org/jira/browse/SPARK-19851
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core, SQL
Affects Versions: 2.1.0
Reporter: Michael Styles






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value

2016-08-14 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420534#comment-15420534
 ] 

Michael Styles commented on SPARK-17035:


I have a fix for this issue if you would like to assign the problem to me.

On Sat, Aug 13, 2016 at 5:31 PM, Dongjoon Hyun (JIRA) 




-- 
Michael Styles
Senior Data Platform Engineer Lead
Shopify


> Conversion of datetime.max to microseconds produces incorrect value
> ---
>
> Key: SPARK-17035
> URL: https://issues.apache.org/jira/browse/SPARK-17035
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Michael Styles
>Priority: Minor
>
> Conversion of datetime.max to microseconds produces incorrect value. For 
> example,
> {noformat}
> from datetime import datetime
> from pyspark.sql import Row
> from pyspark.sql.types import StructType, StructField, TimestampType
> schema = StructType([StructField("dt", TimestampType(), False)])
> data = [{"dt": datetime.max}]
> # convert python objects to sql data
> sql_data = [schema.toInternal(row) for row in data]
> # Value is wrong.
> sql_data
> [(2.534023188e+17,)]
> {noformat}
> This value should be [(2534023187,)].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17037) distinct() operator fails on Dataframe with column names containing periods

2016-08-12 Thread Michael Styles (JIRA)
Michael Styles created SPARK-17037:
--

 Summary: distinct() operator fails on Dataframe with column names 
containing periods
 Key: SPARK-17037
 URL: https://issues.apache.org/jira/browse/SPARK-17037
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Michael Styles


Using the distinct() operator on a Dataframe with column names containing 
periods results in an AnalysisException. For example:

{noformat}
d = [{'pageview.count': 100, 'exit_page': 'example.com/landing'}
df = sqlContext.createDataFrame(d)]
df.distinct()
{noformat}

results in the following error:
pyspark.sql.utils.AnalysisException: u'Cannot resolve column name 
"pageview.count" among (exit_page, pageview.count);'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value

2016-08-12 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-17035:
---
Description: 
Conversion of datetime.max to microseconds produces incorrect value. For 
example,

{noformat}
from datetime import datetime
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, TimestampType

schema = StructType([StructField("dt", TimestampType(), False)])
data = [{"dt": datetime.max}]

# convert python objects to sql data
sql_data = [schema.toInternal(row) for row in data]

# Value is wrong.
sql_data
[(2.534023188e+17,)]
{noformat}

This value should be [(2534023187,)].

  was:
Conversion of datetime.max to microseconds produces incorrect value. For 
example,

from datetime import datetime
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, TimestampType

schema = StructType([StructField("dt", TimestampType(), False)])
data = [{"dt": datetime.max}]

# convert python objects to sql data
sql_data = [schema.toInternal(row) for row in data]

# Value is wrong.
sql_data
[(2.534023188e+17,)]

This value should be [(2534023187,)].


> Conversion of datetime.max to microseconds produces incorrect value
> ---
>
> Key: SPARK-17035
> URL: https://issues.apache.org/jira/browse/SPARK-17035
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Michael Styles
>Priority: Minor
>
> Conversion of datetime.max to microseconds produces incorrect value. For 
> example,
> {noformat}
> from datetime import datetime
> from pyspark.sql import Row
> from pyspark.sql.types import StructType, StructField, TimestampType
> schema = StructType([StructField("dt", TimestampType(), False)])
> data = [{"dt": datetime.max}]
> # convert python objects to sql data
> sql_data = [schema.toInternal(row) for row in data]
> # Value is wrong.
> sql_data
> [(2.534023188e+17,)]
> {noformat}
> This value should be [(2534023187,)].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value

2016-08-12 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-17035:
---
Priority: Minor  (was: Major)

> Conversion of datetime.max to microseconds produces incorrect value
> ---
>
> Key: SPARK-17035
> URL: https://issues.apache.org/jira/browse/SPARK-17035
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Michael Styles
>Priority: Minor
>
> Conversion of datetime.max to microseconds produces incorrect value. For 
> example,
> from datetime import datetime
> from pyspark.sql import Row
> from pyspark.sql.types import StructType, StructField, TimestampType
> schema = StructType([StructField("dt", TimestampType(), False)])
> data = [{"dt": datetime.max}]
> # convert python objects to sql data
> sql_data = [schema.toInternal(row) for row in data]
> # Value is wrong.
> sql_data
> [(2.534023188e+17,)]
> This value should be [(2534023187,)].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value

2016-08-12 Thread Michael Styles (JIRA)
Michael Styles created SPARK-17035:
--

 Summary: Conversion of datetime.max to microseconds produces 
incorrect value
 Key: SPARK-17035
 URL: https://issues.apache.org/jira/browse/SPARK-17035
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Michael Styles


Conversion of datetime.max to microseconds produces incorrect value. For 
example,

from datetime import datetime
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, TimestampType

schema = StructType([StructField("dt", TimestampType(), False)])
data = [{"dt": datetime.max}]

# convert python objects to sql data
sql_data = [schema.toInternal(row) for row in data]

# Value is wrong.
sql_data
[(2.534023188e+17,)]

This value should be [(2534023187,)].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org