[ 
https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569935#comment-17569935
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-17183:
--------------------------------------------------

Specifically about this case, we assume that this Fetch and Sort operations are 
the most external relations in a Substrait plan. Meaning the Sort or Fetch 
operation is called at the end of the other operations. This is not a very 
accurate representation. First we need to understand if this is the general 
case. cc [~westonpace] [~jvanstraten] 

There can be a few settings where these two operations are applied. 

*Sort Only*

If the query only has a Sorting operation instead of adding the 
`SinkNodeConsumer`, we need to add a `OrderBySinkNodeConsumer`. 

*Fetch Only*

If the query only has a Fetch operation, we can include a node with fetch 
capability. At the moment we don't have a node with Fetch capability, so this 
may need to be included where we could be able to use a siimilar logic used in 
`SelectK` node.

*SortAndFetch or FetchAndSort*

If the query contains both sort and fetch in a given order, there has to be a 
single Sink node which can do this operation by the given order. When scanning 
the plan components, when we find a sort we just add a `OrderBySink` and keep 
adding other relations. If we find a Fetch operation, this needs to be replaced 
with a SortAndFetch operation where sorting is done first and fetching is done 
next. And this can go vice-versa. 

*Another Approach:*

Another approach is that we define a sink node which can execute a function 
which does the expected operation. In some of the defined Sink nodes (KSelect, 
Sort) there is a function called {*}`{*}DoFinish{*}`.{*} We should be able to 
call a custom function within this call. So from Substrait end when we extract 
the plan, then we can write the required `std::function` which would be an 
option for this custom sink node. And we assume that a table as input and write 
the logic. This way we don't have to introduce new nodes. And what if there are 
different capabilities users need and ACERO has a limitation, can we always 
keep adding nodes to fullfil that? I am not so sure. This is just a high level 
thought process.

Although I have implemented a _SortAndFetch_ node which can perform the fetch 
followed by a sort just by following what is being done in Sort and SelectK 
nodes. But I am not exactly sure any of these approaches are optimized or the 
best way to solve the problem.  

This is the doubtful component which I am not quite clear what would be the 
most optimize way to include this capability. Or if there is a better way to do 
this. Appreciate your thoughts. 

cc [~westonpace] [~jvanstraten] [~bkietz] [~icook] 

> [C++] Adding ExecNode with Sort and Fetch capability
> ----------------------------------------------------
>
>                 Key: ARROW-17183
>                 URL: https://issues.apache.org/jira/browse/ARROW-17183
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Vibhatha Lakmal Abeykoon
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>
> In Substrait integrations with ACERO, a functionality required is the ability 
> to fetch records sorted and unsorted.
> Fetch operation is defined as selecting `K` number of records with an offset. 
> For instance pick 10 records skipping the first 5 elements. Here we can 
> define this as a Slice operation and records can be easily extracted in a 
> sink-node. 
> Sort and Fetch operation applies when we need to execute a Fetch operation on 
> sorted data. The main issue is we cannot have a sort node followed by a 
> fetch. The reason is that all existing node definitions supporting sort are 
> based on sink nodes. Since there cannot be a node followed by sink, this 
> functionality has to take place in a single node. 
> But this is not a perfect solution for fetch and sort, but one way to do this 
> is define a sink node where the records are sorted and then a set of items 
> are fetched. 
> Another dilema is what if sort is followed by a fetch. In that case, there 
> has to be a flag to enable the order of the operations. 
> The objective of this ticket is to discuss a viable efficient solution and 
> include new nodes or a method to execute such a logic.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to