[jira] [Commented] (SPARK-22456) Add new function dayofweek
[ https://issues.apache.org/jira/browse/SPARK-22456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240331#comment-16240331 ] Michael Styles commented on SPARK-22456: This functionality is part of the SQL standard and supported by a number of relational database systems including Oracle, DB2, SQL Server, MySQL, and Teradata, to name a few. > Add new function dayofweek > -- > > Key: SPARK-22456 > URL: https://issues.apache.org/jira/browse/SPARK-22456 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Michael Styles > > Add new function *dayofweek* to return the day of the week of the given > argument as an integer value in the range 1-7, where 1 represents Sunday. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22456) Add new function dayofweek
Michael Styles created SPARK-22456: -- Summary: Add new function dayofweek Key: SPARK-22456 URL: https://issues.apache.org/jira/browse/SPARK-22456 Project: Spark Issue Type: New Feature Components: PySpark, SQL Affects Versions: 2.2.0 Reporter: Michael Styles Add new function *dayofweek* to return the day of the week of the given argument as an integer value in the range 1-7, where 1 represents Sunday. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21692) Modify PythonUDF to support nullability
[ https://issues.apache.org/jira/browse/SPARK-21692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123461#comment-16123461 ] Michael Styles commented on SPARK-21692: https://github.com/apache/spark/pull/18906 > Modify PythonUDF to support nullability > --- > > Key: SPARK-21692 > URL: https://issues.apache.org/jira/browse/SPARK-21692 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Michael Styles > > When creating or registering Python UDFs, a user may know whether null values > can be returned by the function. PythonUDF and related classes should be > modified to support nullability. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21692) Modify PythonUDF to support nullability
Michael Styles created SPARK-21692: -- Summary: Modify PythonUDF to support nullability Key: SPARK-21692 URL: https://issues.apache.org/jira/browse/SPARK-21692 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 2.2.0 Reporter: Michael Styles When creating or registering Python UDFs, a user may know whether null values can be returned by the function. PythonUDF and related classes should be modified to support nullability. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17091) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066363#comment-16066363 ] Michael Styles commented on SPARK-17091: n Parquet 1.7, there as a bug involving corrupt statistics on binary columns (https://issues.apache.org/jira/browse/PARQUET-251). This bug prevented earlier versions of Spark from generating Parquet filters on any string columns. Spark 2.1 has moved up to Parquet 1.8.2, so this issue no longer exists. > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > Attachments: IN Predicate.png, OR Predicate.png > > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles resolved SPARK-21218. Resolution: Duplicate > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > Attachments: IN Predicate.png, OR Predicate.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17091) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065413#comment-16065413 ] Michael Styles edited comment on SPARK-17091 at 6/27/17 8:26 PM: - By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} I'm seeing about a 50 -75 % improvement. See attachments. was (Author: ptkool): By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} I'm seeing about a 50 -75 % improvement. > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > Attachments: IN Predicate.png, OR Predicate.png > > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17091) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-17091: --- Summary: Convert IN predicate to equivalent Parquet filter (was: ParquetFilters rewrite IN to OR of Eq) > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > Attachments: IN Predicate.png, OR Predicate.png > > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-17091: --- Attachment: IN Predicate.png OR Predicate.png > ParquetFilters rewrite IN to OR of Eq > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > Attachments: IN Predicate.png, OR Predicate.png > > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065413#comment-16065413 ] Michael Styles commented on SPARK-17091: By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} I'm seeing about a 50 -75 % improvement. > ParquetFilters rewrite IN to OR of Eq > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq
[ https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles reopened SPARK-17091: > ParquetFilters rewrite IN to OR of Eq > - > > Key: SPARK-17091 > URL: https://issues.apache.org/jira/browse/SPARK-17091 > Project: Spark > Issue Type: Bug >Reporter: Andrew Duffy > > Past attempts at pushing down the InSet operation for Parquet relied on > user-defined predicates. It would be simpler to rewrite an IN clause into the > corresponding OR union of a set of equality conditions. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065210#comment-16065210 ] Michael Styles commented on SPARK-21218: In Parquet 1.7, there as a bug involving corrupt statistics on binary columns (https://issues.apache.org/jira/browse/PARQUET-251). This bug prevented earlier versions of Spark from generating Parquet filters on any string columns. Spark 2.1 has moved up to Parquet 1.8.2, so this issue no longer exists. > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > Attachments: IN Predicate.png, OR Predicate.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065142#comment-16065142 ] Michael Styles commented on SPARK-21218: [~hyukjin.kwon] Not sure I understand what you want me to do with my PR, assuming this JIRA is resolved as a duplicate? > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > Attachments: IN Predicate.png, OR Predicate.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608 ] Michael Styles edited comment on SPARK-21218 at 6/27/17 12:17 PM: -- By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} I'm seeing about a 50 -75 % improvement. was (Author: ptkool): By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} Notice the difference in the number of output rows for the scans (see attachments). Also, the IN predicate test took about 1.1 minutes, while the OR predicate test took about 16 seconds. > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > Attachments: IN Predicate.png, OR Predicate.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608 ] Michael Styles edited comment on SPARK-21218 at 6/27/17 10:30 AM: -- By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} Notice the difference in the number of output rows for the scans (see attachments). Also, the IN predicate test took about 1.1 minutes, while the OR predicate test took about 16 seconds. was (Author: ptkool): By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} !IN Predicate.png|thumbnail! *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} !OR Predicate.png|thumbnail! Notice the difference in the number of output rows for the scan. > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > Attachments: IN Predicate.png, OR Predicate.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608 ] Michael Styles edited comment on SPARK-21218 at 6/27/17 10:25 AM: -- By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} !IN Predicate.png|thumbnail! *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} !OR Predicate.png|thumbnail! Notice the difference in the number of output rows for the scan. was (Author: ptkool): By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} !IN Predicate.png|thumbnail! *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} !OR Pedicate.png|thumbnail! Notice the difference in the number of output rows for the scan. > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > Attachments: IN Predicate.png, OR Predicate.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608 ] Michael Styles commented on SPARK-21218: By not pushing the filter to Parquet, are we not preventing Parquet from skipping blocks during read operations? I have tests that show big improvements when applying this transformation. For instance, I have a Parquet file with 162,456,394 rows which is sorted on column C1. *IN Predicate* {noformat} df.filter[df['C1'].isin([42, 139])).collect() {noformat} !IN Predicate.png|thumbnail! *OR Predicate* {noformat} df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect() {noformat} !OR Pedicate.png|thumbnail! Notice the difference in the number of output rows for the scan. > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > Attachments: IN Predicate.png, OR Predicate.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-21218: --- Attachment: (was: Starscream Console on OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 06-02-53.png) > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > Attachments: IN Predicate.png, OR Predicate.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-21218: --- Attachment: (was: Starscream Console on OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 2017-06-27 06-03-29.png) > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > Attachments: IN Predicate.png, OR Predicate.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-21218: --- Attachment: IN Predicate.png OR Predicate.png > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > Attachments: IN Predicate.png, OR Predicate.png, Starscream Console > on OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 > 2017-06-27 06-03-29.png, Starscream Console on > OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 > 06-02-53.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
[ https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-21218: --- Attachment: Starscream Console on OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 06-02-53.png Starscream Console on OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 2017-06-27 06-03-29.png > Convert IN predicate to equivalent Parquet filter > - > > Key: SPARK-21218 > URL: https://issues.apache.org/jira/browse/SPARK-21218 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.1 >Reporter: Michael Styles > Attachments: Starscream Console on > OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 2017-06-27 > 06-03-29.png, Starscream Console on > OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 > 06-02-53.png > > > Convert IN predicate to equivalent expression involving equality conditions > to allow the filter to be pushed down to Parquet. > For instance, > C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21218) Convert IN predicate to equivalent Parquet filter
Michael Styles created SPARK-21218: -- Summary: Convert IN predicate to equivalent Parquet filter Key: SPARK-21218 URL: https://issues.apache.org/jira/browse/SPARK-21218 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 2.1.1 Reporter: Michael Styles Convert IN predicate to equivalent expression involving equality conditions to allow the filter to be pushed down to Parquet. For instance, C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20636) Eliminate unnecessary shuffle with adjacent Window expressions
Michael Styles created SPARK-20636: -- Summary: Eliminate unnecessary shuffle with adjacent Window expressions Key: SPARK-20636 URL: https://issues.apache.org/jira/browse/SPARK-20636 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 2.1.1 Reporter: Michael Styles Consider the following example: {noformat} w1 = Window.partitionBy("sno") w2 = Window.partitionBy("sno", "pno") supply \ .select('sno', 'pno', 'qty', F.sum('qty').over(w2).alias('sum_qty_2')) \ .select('sno', 'pno', 'qty', F.col('sum_qty_2'), F.sum('qty').over(w1).alias('sum_qty_1')) \ .explain() == Optimized Logical Plan == Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1112L], [sno#980] +- Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1105L], [sno#980, pno#981] +- Relation[sno#980,pno#981,qty#982L] parquet == Physical Plan == Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1112L], [sno#980] +- *Sort [sno#980 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(sno#980, 200) +- Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1105L], [sno#980, pno#981] +- *Sort [sno#980 ASC NULLS FIRST, pno#981 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(sno#980, pno#981, 200) +- *FileScan parquet [sno#980,pno#981,qty#982L] ... {noformat} A more efficient query plan can be achieved by flipping the Window expressions to eliminate an unnecessary shuffle as follows: {noformat} == Optimized Logical Plan == Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1087L], [sno#980, pno#981] +- Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1085L], [sno#980] +- Relation[sno#980,pno#981,qty#982L] parquet == Physical Plan == Window [sum(qty#982L) windowspecdefinition(sno#980, pno#981, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_2#1087L], [sno#980, pno#981] +- *Sort [sno#980 ASC NULLS FIRST, pno#981 ASC NULLS FIRST], false, 0 +- Window [sum(qty#982L) windowspecdefinition(sno#980, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS sum_qty_1#1085L], [sno#980] +- *Sort [sno#980 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(sno#980, 200) +- *FileScan parquet [sno#980,pno#981,qty#982L] ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20463) Add support for IS [NOT] DISTINCT FROM to SPARK SQL
[ https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-20463: --- Description: Add support for the SQL standard distinct predicate to SPARK SQL. {noformat} IS [NOT] DISTINCT FROM {noformat} {noformat} data = [(10, 20), (30, 30), (40, None), (None, None)] df = sc.parallelize(data).toDF(["c1", "c2"]) df.createTempView("df") spark.sql("select c1, c2 from df where c1 is not distinct from c2").collect() [Row(c1=30, c2=30), Row(c1=None, c2=None)] {noformat} was: Expose the SPARK SQL '<=>' operator in Pyspark as a column function called *isNotDistinctFrom*. For example: {panel} {noformat} data = [(10, 20), (30, 30), (40, None), (None, None)] df2 = sc.parallelize(data).toDF("c1", "c2") df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect()) [Row(c1=30, c2=30), Row(c1=None, c2=None)] {noformat} {panel} > Add support for IS [NOT] DISTINCT FROM to SPARK SQL > --- > > Key: SPARK-20463 > URL: https://issues.apache.org/jira/browse/SPARK-20463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles > > Add support for the SQL standard distinct predicate to SPARK SQL. > {noformat} > IS [NOT] DISTINCT FROM > {noformat} > {noformat} > data = [(10, 20), (30, 30), (40, None), (None, None)] > df = sc.parallelize(data).toDF(["c1", "c2"]) > df.createTempView("df") > spark.sql("select c1, c2 from df where c1 is not distinct from c2").collect() > [Row(c1=30, c2=30), Row(c1=None, c2=None)] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20463) Add support for IS [NOT] DISTINCT FROM to SPARK SQL
[ https://issues.apache.org/jira/browse/SPARK-20463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-20463: --- Component/s: (was: PySpark) SQL Summary: Add support for IS [NOT] DISTINCT FROM to SPARK SQL (was: Expose SPARK SQL <=> operator in PySpark) > Add support for IS [NOT] DISTINCT FROM to SPARK SQL > --- > > Key: SPARK-20463 > URL: https://issues.apache.org/jira/browse/SPARK-20463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles > > Expose the SPARK SQL '<=>' operator in Pyspark as a column function called > *isNotDistinctFrom*. For example: > {panel} > {noformat} > data = [(10, 20), (30, 30), (40, None), (None, None)] > df2 = sc.parallelize(data).toDF("c1", "c2") > df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect()) > [Row(c1=30, c2=30), Row(c1=None, c2=None)] > {noformat} > {panel} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20463) Expose SPARK SQL <=> operator in PySpark
Michael Styles created SPARK-20463: -- Summary: Expose SPARK SQL <=> operator in PySpark Key: SPARK-20463 URL: https://issues.apache.org/jira/browse/SPARK-20463 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.1.0 Reporter: Michael Styles Expose the SPARK SQL '<=>' operator in Pyspark as a column function called *isNotDistinctFrom*. For example: {panel} {noformat} data = [(10, 20), (30, 30), (40, None), (None, None)] df2 = sc.parallelize(data).toDF("c1", "c2") df2.where(df2["c1"].isNotDistinctFrom(df2["c2"]).collect()) [Row(c1=30, c2=30), Row(c1=None, c2=None)] {noformat} {panel} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20413) New Optimizer Hint to prevent collapsing of adjacent projections
[ https://issues.apache.org/jira/browse/SPARK-20413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-20413: --- Description: I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This hint is essentially identical to Oracle's NO_MERGE hint. Let me first give an example of why I am proposing this. {noformat} df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Analyzed Logical Plan == c1: string, c2: string Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Optimized Logical Plan == Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- LogicalRDD [id#80L, user_agent#81] == Physical Plan == *Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- Scan ExistingRDD[id#80L,user_agent#81] {noformat} user_agent_details is a user-defined function that returns a struct. As can be seen from the generated query plan, the function is being executed multiple times which could lead to performance issues. This is due to the CollapseProject optimizer rule that collapses adjacent projections. I'm proposing a hint that will prevent the optimizer from collapsing adjacent projections. A new function called 'no_collapse' would be introduced for this purpose. Consider the following example and generated query plan. {noformat} df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = F.no_collapse(df1.withColumn("ua", user_agent_details(df1["user_agent"]))) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS c2#76] +- NoCollapseHint +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Analyzed Logical Plan == c1: string, c2: string Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- NoCollapseHint +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Optimized Logical Plan == Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- NoCollapseHint +- Project [UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Physical Plan == *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- *Project [UDF(user_agent#65) AS ua#69] +- Scan ExistingRDD[id#64L,user_agent#65] {noformat} As can be seen from the query plan, the user-defined function is now evaluated once per row. was: I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This hint is essentially identical to Oracle's NO_MERGE hint. Let me first give an example of why I am proposing this. {noformat} df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Analyzed Logical Plan == c1: string, c2: string Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Optimized Logical Plan == Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- LogicalRDD [id#80L, user_agent#81] == Physical Plan == *Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- Scan ExistingRDD[id#80L,user_agent#81] {noformat} user_agent_details is a user-defined function that returns a struct. As can be seen from the generated query plan, the function is being executed multiple times which could lead to performance issues. This is due to the CollapseProject optimizer rule that collapses adjacent projections. I'm proposing a hint that will prevent the optimizer from collapsing adjacent projections. A new function called 'no_colla
[jira] [Updated] (SPARK-20413) New Optimizer Hint to prevent collapsing of adjacent projections
[ https://issues.apache.org/jira/browse/SPARK-20413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-20413: --- Description: I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This hint is essentially identical to Oracle's NO_MERGE hint. Let me first give an example of why I am proposing this. {noformat} df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Analyzed Logical Plan == c1: string, c2: string Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Optimized Logical Plan == Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- LogicalRDD [id#80L, user_agent#81] == Physical Plan == *Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- Scan ExistingRDD[id#80L,user_agent#81] {noformat} user_agent_details is a user-defined function that returns a struct. As can be seen from the generated query plan, the function is being executed multiple times which could lead to performance issues. This is due to the CollapseProject optimizer rule that collapses adjacent projections. I'm proposing a hint that will prevent the optimizer from collapsing adjacent projections. A new function called 'no_collapse' would be introduced for this purpose. Consider the following example and generated query plan. df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = F.no_collapse(df1.withColumn("ua", user_agent_details(df1["user_agent"]))) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS c2#76] +- NoCollapseHint +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Analyzed Logical Plan == c1: string, c2: string Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- NoCollapseHint +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Optimized Logical Plan == Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- NoCollapseHint +- Project [UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Physical Plan == *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- *Project [UDF(user_agent#65) AS ua#69] +- Scan ExistingRDD[id#64L,user_agent#65] As can be seen from the query plan, the user-defined function is now evaluated once per row. was: I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This hint is essentially identical to Oracle's NO_MERGE hint. Let me first give an example of why I am proposing this. df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Analyzed Logical Plan == c1: string, c2: string Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Optimized Logical Plan == Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- LogicalRDD [id#80L, user_agent#81] == Physical Plan == *Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- Scan ExistingRDD[id#80L,user_agent#81] user_agent_details is a user-defined function that returns a struct. As can be seen from the generated query plan, the function is being executed multiple times which could lead to performance issues. This is due to the CollapseProject optimizer rule that collapses adjacent projections. I'm proposing a hint that will prevent the optimizer from collapsing adjacent projections. A new function called 'no_collapse' would be introduced for this purpose.
[jira] [Updated] (SPARK-20413) New Optimizer Hint to prevent collapsing of adjacent projections
[ https://issues.apache.org/jira/browse/SPARK-20413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-20413: --- Description: I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This hint is essentially identical to Oracle's NO_MERGE hint. Let me first give an example of why I am proposing this. df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Analyzed Logical Plan == c1: string, c2: string Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Optimized Logical Plan == Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- LogicalRDD [id#80L, user_agent#81] == Physical Plan == *Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- Scan ExistingRDD[id#80L,user_agent#81] user_agent_details is a user-defined function that returns a struct. As can be seen from the generated query plan, the function is being executed multiple times which could lead to performance issues. This is due to the CollapseProject optimizer rule that collapses adjacent projections. I'm proposing a hint that will prevent the optimizer from collapsing adjacent projections. A new function called 'no_collapse' would be introduced for this purpose. Consider the following example and generated query plan. df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = F.no_collapse(df1.withColumn("ua", user_agent_details(df1["user_agent"]))) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS c2#76] +- NoCollapseHint +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Analyzed Logical Plan == c1: string, c2: string Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- NoCollapseHint +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Optimized Logical Plan == Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- NoCollapseHint +- Project [UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Physical Plan == *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- *Project [UDF(user_agent#65) AS ua#69] +- Scan ExistingRDD[id#64L,user_agent#65] As can be seen from the query plan, the user-defined function is now evaluated once per row. was: I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This hint is essentially identical to Oracle's NO_MERGE hint. Let me first give an example of why I am proposing this. df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Analyzed Logical Plan == c1: string, c2: string Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Optimized Logical Plan == Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- LogicalRDD [id#80L, user_agent#81] == Physical Plan == *Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- Scan ExistingRDD[id#80L,user_agent#81] user_agent_details is a user-defined function that returns a struct. As can be seen from the generated query plan, the function is being executed multiple times which could lead to performance issues. This is due to the CollapseProject optimizer rule that collapses adjacent projections. I'm proposing a hint that prevent the optimizer from collapsing adjacent projections. A new function called 'no_collapse' would be introduced for this purpose. Consider the following exam
[jira] [Created] (SPARK-20413) New Optimizer Hint to prevent collapsing of adjacent projections
Michael Styles created SPARK-20413: -- Summary: New Optimizer Hint to prevent collapsing of adjacent projections Key: SPARK-20413 URL: https://issues.apache.org/jira/browse/SPARK-20413 Project: Spark Issue Type: Improvement Components: Optimizer, PySpark, SQL Affects Versions: 2.1.0 Reporter: Michael Styles I am proposing that a new optimizer hint called NO_COLLAPSE be introduced. This hint is essentially identical to Oracle's NO_MERGE hint. Let me first give an example of why I am proposing this. df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = df1.withColumn("ua", user_agent_details(df1["user_agent"])) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#85[device_form_factor] AS c1#90, ua#85[browser_version] AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Analyzed Logical Plan == c1: string, c2: string Project [ua#85.device_form_factor AS c1#90, ua#85.browser_version AS c2#91] +- Project [id#80L, user_agent#81, UDF(user_agent#81) AS ua#85] +- LogicalRDD [id#80L, user_agent#81] == Optimized Logical Plan == Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- LogicalRDD [id#80L, user_agent#81] == Physical Plan == *Project [UDF(user_agent#81).device_form_factor AS c1#90, UDF(user_agent#81).browser_version AS c2#91] +- Scan ExistingRDD[id#80L,user_agent#81] user_agent_details is a user-defined function that returns a struct. As can be seen from the generated query plan, the function is being executed multiple times which could lead to performance issues. This is due to the CollapseProject optimizer rule that collapses adjacent projections. I'm proposing a hint that prevent the optimizer from collapsing adjacent projections. A new function called 'no_collapse' would be introduced for this purpose. Consider the following example and generated query plan. df1 = sc.sql.createDataFrame([(1, "abc")], ["id", "user_agent"]) df2 = F.no_collapse(df1.withColumn("ua", user_agent_details(df1["user_agent"]))) df3 = df2.select(df2["ua"].device_form_factor.alias("c1"), df2["ua"].browser_version.alias("c2")) df3.explain(True) == Parsed Logical Plan == 'Project [ua#69[device_form_factor] AS c1#75, ua#69[browser_version] AS c2#76] +- NoCollapseHint +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Analyzed Logical Plan == c1: string, c2: string Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- NoCollapseHint +- Project [id#64L, user_agent#65, UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Optimized Logical Plan == Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- NoCollapseHint +- Project [UDF(user_agent#65) AS ua#69] +- LogicalRDD [id#64L, user_agent#65] == Physical Plan == *Project [ua#69.device_form_factor AS c1#75, ua#69.browser_version AS c2#76] +- *Project [UDF(user_agent#65) AS ua#69] +- Scan ExistingRDD[id#64L,user_agent#65] As can be seen from the query plan, the user-defined function is now evaluated once per row. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20350) Apply Complementation Laws during boolean expression simplification
Michael Styles created SPARK-20350: -- Summary: Apply Complementation Laws during boolean expression simplification Key: SPARK-20350 URL: https://issues.apache.org/jira/browse/SPARK-20350 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 2.1.0 Reporter: Michael Styles Apply Complementation Laws during boolean expression simplification. * A AND NOT(A) == FALSE * A OR NOT(A) == TRUE -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
[ https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-19851: --- Description: Add support for EVERY and ANY (SOME) aggregates. - EVERY returns true if all input values are true. - ANY returns true if at least one input value is true. - SOME is equivalent to ANY. Both aggregates are part of the SQL standard. > Add support for EVERY and ANY (SOME) aggregates > --- > > Key: SPARK-19851 > URL: https://issues.apache.org/jira/browse/SPARK-19851 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles > > Add support for EVERY and ANY (SOME) aggregates. > - EVERY returns true if all input values are true. > - ANY returns true if at least one input value is true. > - SOME is equivalent to ANY. > Both aggregates are part of the SQL standard. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
[ https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899988#comment-15899988 ] Michael Styles commented on SPARK-19851: https://github.com/apache/spark/pull/17194 > Add support for EVERY and ANY (SOME) aggregates > --- > > Key: SPARK-19851 > URL: https://issues.apache.org/jira/browse/SPARK-19851 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
Michael Styles created SPARK-19851: -- Summary: Add support for EVERY and ANY (SOME) aggregates Key: SPARK-19851 URL: https://issues.apache.org/jira/browse/SPARK-19851 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core, SQL Affects Versions: 2.1.0 Reporter: Michael Styles -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value
[ https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15420534#comment-15420534 ] Michael Styles commented on SPARK-17035: I have a fix for this issue if you would like to assign the problem to me. On Sat, Aug 13, 2016 at 5:31 PM, Dongjoon Hyun (JIRA) -- Michael Styles Senior Data Platform Engineer Lead Shopify > Conversion of datetime.max to microseconds produces incorrect value > --- > > Key: SPARK-17035 > URL: https://issues.apache.org/jira/browse/SPARK-17035 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Michael Styles >Priority: Minor > > Conversion of datetime.max to microseconds produces incorrect value. For > example, > {noformat} > from datetime import datetime > from pyspark.sql import Row > from pyspark.sql.types import StructType, StructField, TimestampType > schema = StructType([StructField("dt", TimestampType(), False)]) > data = [{"dt": datetime.max}] > # convert python objects to sql data > sql_data = [schema.toInternal(row) for row in data] > # Value is wrong. > sql_data > [(2.534023188e+17,)] > {noformat} > This value should be [(2534023187,)]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17037) distinct() operator fails on Dataframe with column names containing periods
Michael Styles created SPARK-17037: -- Summary: distinct() operator fails on Dataframe with column names containing periods Key: SPARK-17037 URL: https://issues.apache.org/jira/browse/SPARK-17037 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Michael Styles Using the distinct() operator on a Dataframe with column names containing periods results in an AnalysisException. For example: {noformat} d = [{'pageview.count': 100, 'exit_page': 'example.com/landing'} df = sqlContext.createDataFrame(d)] df.distinct() {noformat} results in the following error: pyspark.sql.utils.AnalysisException: u'Cannot resolve column name "pageview.count" among (exit_page, pageview.count);' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value
[ https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-17035: --- Description: Conversion of datetime.max to microseconds produces incorrect value. For example, {noformat} from datetime import datetime from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, TimestampType schema = StructType([StructField("dt", TimestampType(), False)]) data = [{"dt": datetime.max}] # convert python objects to sql data sql_data = [schema.toInternal(row) for row in data] # Value is wrong. sql_data [(2.534023188e+17,)] {noformat} This value should be [(2534023187,)]. was: Conversion of datetime.max to microseconds produces incorrect value. For example, from datetime import datetime from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, TimestampType schema = StructType([StructField("dt", TimestampType(), False)]) data = [{"dt": datetime.max}] # convert python objects to sql data sql_data = [schema.toInternal(row) for row in data] # Value is wrong. sql_data [(2.534023188e+17,)] This value should be [(2534023187,)]. > Conversion of datetime.max to microseconds produces incorrect value > --- > > Key: SPARK-17035 > URL: https://issues.apache.org/jira/browse/SPARK-17035 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Michael Styles >Priority: Minor > > Conversion of datetime.max to microseconds produces incorrect value. For > example, > {noformat} > from datetime import datetime > from pyspark.sql import Row > from pyspark.sql.types import StructType, StructField, TimestampType > schema = StructType([StructField("dt", TimestampType(), False)]) > data = [{"dt": datetime.max}] > # convert python objects to sql data > sql_data = [schema.toInternal(row) for row in data] > # Value is wrong. > sql_data > [(2.534023188e+17,)] > {noformat} > This value should be [(2534023187,)]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value
[ https://issues.apache.org/jira/browse/SPARK-17035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-17035: --- Priority: Minor (was: Major) > Conversion of datetime.max to microseconds produces incorrect value > --- > > Key: SPARK-17035 > URL: https://issues.apache.org/jira/browse/SPARK-17035 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Michael Styles >Priority: Minor > > Conversion of datetime.max to microseconds produces incorrect value. For > example, > from datetime import datetime > from pyspark.sql import Row > from pyspark.sql.types import StructType, StructField, TimestampType > schema = StructType([StructField("dt", TimestampType(), False)]) > data = [{"dt": datetime.max}] > # convert python objects to sql data > sql_data = [schema.toInternal(row) for row in data] > # Value is wrong. > sql_data > [(2.534023188e+17,)] > This value should be [(2534023187,)]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17035) Conversion of datetime.max to microseconds produces incorrect value
Michael Styles created SPARK-17035: -- Summary: Conversion of datetime.max to microseconds produces incorrect value Key: SPARK-17035 URL: https://issues.apache.org/jira/browse/SPARK-17035 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Michael Styles Conversion of datetime.max to microseconds produces incorrect value. For example, from datetime import datetime from pyspark.sql import Row from pyspark.sql.types import StructType, StructField, TimestampType schema = StructType([StructField("dt", TimestampType(), False)]) data = [{"dt": datetime.max}] # convert python objects to sql data sql_data = [schema.toInternal(row) for row in data] # Value is wrong. sql_data [(2.534023188e+17,)] This value should be [(2534023187,)]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org