Hans Zeller created TRAFODION-2392:
--------------------------------------

             Summary: Avoid a costly sort for highly reducing TMUDFs
                 Key: TRAFODION-2392
                 URL: https://issues.apache.org/jira/browse/TRAFODION-2392
             Project: Apache Trafodion
          Issue Type: Improvement
          Components: sql-cmp
    Affects Versions: 2.0-incubating
         Environment: Any
            Reporter: Hans Zeller
            Assignee: Hans Zeller


When an input table with a PARTITION BY is specified in a TMUDF, the Trafodion 
optimizer ensures that the input rows are sorted on (a permutation of) the 
PARTITION BY columns, so that each parallel TMUDF instance sees the input rows 
of such a logical partition in contiguous rows. This way the TMUDF can process 
each group separately.

This is usually a good way to process the data, except when we are dealing with 
a large input table and a TMUDF that highly reduces the input data. In that 
case it may be better to maintain a hash table of groups in the TMUDF and to 
avoid the costly sort of the input table.

My proposal is to add a new function type to UDRInvocationInfo.FunctionType, 
called REDUCER_NC (for Non-Contiguous). Setting the function type to this new 
type would indicate to the optimizer not to request a sort order on the 
partitioning columns.

The table below shows how the function type and PARTITION BY and ORDER BY 
clauses would determine the effective sort order produced by the optimizer:

||Function type||PARTITION BY||ORDER BY||Data is sorted by||
|REDUCER (existing)|a,b|c,d|a,b,c,d|
|REDUCER (existing)|a,b|<empty>|a,b|
|REDUCER_NC (proposed)|a,b|c,d|c,d|
|REDUCER_NC (proposed)|a,b|<empty>|<no sort>|

In all other aspects, REDUCER and REDUCER_NC function types would behave the 
same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to