[jira] [Commented] (SPARK-28304) FileFormatWriter introduces an uncoditional sort, even when all attributes are constants

2019-07-20 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889436#comment-16889436
 ] 

Eyal Farago commented on SPARK-28304:
-

[~joshrosen], thanks for your comment,

I think this is a bit broader subject than just FileFormatWriter as any sort 
operator can either simplify its ordering or be completely eliminated when 
some/all of the sort columns are known to be constant.

furthermore, in some cases one ordering can be used to satisfy several 
orderings (or on the other hand, ordering requirements of downstream operators 
can be relaxed) - so I believe this is best handled by the optimizer/planner, 
by essentially making EnsureRequirements aware of these kind of cases. as a 
short term fix, the SortExec operator can filter away constant ordering columns 
in the execute() method, and in case it's left with no ordering columns simply 
bypass the sort altogether. 

BTW, is there any reason the FileFormatWriter doesn't take the regular 
optimizing/planning path?

 

> FileFormatWriter introduces an uncoditional sort, even when all attributes 
> are constants
> 
>
> Key: SPARK-28304
> URL: https://issues.apache.org/jira/browse/SPARK-28304
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Eyal Farago
>Priority: Major
>  Labels: performance
>
> FileFormatWriter derives a required sort order based on the partition 
> columns, bucketing columns and explicitly required ordering. However in some 
> use cases Some (or even all) of these fields are constant, in these cases the 
> sort can be skipped.
> i.e. in my use-case, we add a GUUID column identifying a specific 
> (incremental) load, this can be thought of as a batch id. Since we run one 
> batch at a time, this column is always a constant which means there's no need 
> to sort based on this column, since we don't use bucketing or require an 
> explicit ordering the entire sort can be skipped for our case.
>  
> I suggest:
>  # filter away constant columns from the required ordering calculated by 
> FileFormatWriter 
>  # generalizing this to any Sort operator in a spark plan.
>  # introduce optimizer rules to remove constants from sort ordering, 
> potentially eliminating the sort operator altogether.
>  # modify EnsureRequirements to be aware of constant field when deciding 
> whether to introduce a sort or not. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28304) FileFormatWriter introduces an uncoditional sort, even when all attributes are constants

2019-07-14 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884801#comment-16884801
 ] 

Josh Rosen commented on SPARK-28304:


I'm linking SPARK-21317 as a related issue: that older ticket was about 
avoiding unnecessary sorts in FileFormatWriter when data is pre-bucketed. 
That's not _quite_ the same as the issue proposed here (which deals with the 
special case where the partition columns are constant), but it seems pretty 
closely related.

/cc [~pwoody] (who submitted a PR for that other ticket)

> FileFormatWriter introduces an uncoditional sort, even when all attributes 
> are constants
> 
>
> Key: SPARK-28304
> URL: https://issues.apache.org/jira/browse/SPARK-28304
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Eyal Farago
>Priority: Major
>  Labels: performance
>
> FileFormatWriter derives a required sort order based on the partition 
> columns, bucketing columns and explicitly required ordering. However in some 
> use cases Some (or even all) of these fields are constant, in these cases the 
> sort can be skipped.
> i.e. in my use-case, we add a GUUID column identifying a specific 
> (incremental) load, this can be thought of as a batch id. Since we run one 
> batch at a time, this column is always a constant which means there's no need 
> to sort based on this column, since we don't use bucketing or require an 
> explicit ordering the entire sort can be skipped for our case.
>  
> I suggest:
>  # filter away constant columns from the required ordering calculated by 
> FileFormatWriter 
>  # generalizing this to any Sort operator in a spark plan.
>  # introduce optimizer rules to remove constants from sort ordering, 
> potentially eliminating the sort operator altogether.
>  # modify EnsureRequirements to be aware of constant field when deciding 
> whether to introduce a sort or not. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28304) FileFormatWriter introduces an uncoditional sort, even when all attributes are constants

2019-07-08 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880402#comment-16880402
 ] 

Eyal Farago commented on SPARK-28304:
-

cc [~cloud_fan],[~hvanhovell]

> FileFormatWriter introduces an uncoditional sort, even when all attributes 
> are constants
> 
>
> Key: SPARK-28304
> URL: https://issues.apache.org/jira/browse/SPARK-28304
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Eyal Farago
>Priority: Major
>  Labels: performance
>
> FileFormatWriter derives a required sort order based on the partition 
> columns, bucketing columns and explicitly required ordering. However in some 
> use cases Some (or even all) of these fields are constant, in these cases the 
> sort can be skipped.
> i.e. in my use-case, we add a GUUID column identifying a specific 
> (incremental) load, this can be thought of as a batch id. Since we run one 
> batch at a time, this column is always a constant which means there's no need 
> to sort based on this column, since we don't use bucketing or require an 
> explicit ordering the entire sort can be skipped for our case.
>  
> I suggest:
>  # filter away constant columns from the required ordering calculated by 
> FileFormatWriter 
>  # generalizing this to any Sort operator in a spark plan.
>  # introduce optimizer rules to remove constants from sort ordering, 
> potentially eliminating the sort operator altogether.
>  # modify EnsureRequirements to be aware of constant field when deciding 
> whether to introduce a sort or not. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org