[ https://issues.apache.org/jira/browse/ARROW-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17569935#comment-17569935 ]
Vibhatha Lakmal Abeykoon commented on ARROW-17183: -------------------------------------------------- Specifically about this case, we assume that this Fetch and Sort operations are the most external relations in a Substrait plan. Meaning the Sort or Fetch operation is called at the end of the other operations. This is not a very accurate representation. First we need to understand if this is the general case. cc [~westonpace] [~jvanstraten] There can be a few settings where these two operations are applied. *Sort Only* If the query only has a Sorting operation instead of adding the `SinkNodeConsumer`, we need to add a `OrderBySinkNodeConsumer`. *Fetch Only* If the query only has a Fetch operation, we can include a node with fetch capability. At the moment we don't have a node with Fetch capability, so this may need to be included where we could be able to use a siimilar logic used in `SelectK` node. *SortAndFetch or FetchAndSort* If the query contains both sort and fetch in a given order, there has to be a single Sink node which can do this operation by the given order. When scanning the plan components, when we find a sort we just add a `OrderBySink` and keep adding other relations. If we find a Fetch operation, this needs to be replaced with a SortAndFetch operation where sorting is done first and fetching is done next. And this can go vice-versa. *Another Approach:* Another approach is that we define a sink node which can execute a function which does the expected operation. In some of the defined Sink nodes (KSelect, Sort) there is a function called {*}`{*}DoFinish{*}`.{*} We should be able to call a custom function within this call. So from Substrait end when we extract the plan, then we can write the required `std::function` which would be an option for this custom sink node. And we assume that a table as input and write the logic. This way we don't have to introduce new nodes. And what if there are different capabilities users need and ACERO has a limitation, can we always keep adding nodes to fullfil that? I am not so sure. This is just a high level thought process. Although I have implemented a _SortAndFetch_ node which can perform the fetch followed by a sort just by following what is being done in Sort and SelectK nodes. But I am not exactly sure any of these approaches are optimized or the best way to solve the problem. This is the doubtful component which I am not quite clear what would be the most optimize way to include this capability. Or if there is a better way to do this. Appreciate your thoughts. cc [~westonpace] [~jvanstraten] [~bkietz] [~icook] > [C++] Adding ExecNode with Sort and Fetch capability > ---------------------------------------------------- > > Key: ARROW-17183 > URL: https://issues.apache.org/jira/browse/ARROW-17183 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Vibhatha Lakmal Abeykoon > Assignee: Vibhatha Lakmal Abeykoon > Priority: Major > > In Substrait integrations with ACERO, a functionality required is the ability > to fetch records sorted and unsorted. > Fetch operation is defined as selecting `K` number of records with an offset. > For instance pick 10 records skipping the first 5 elements. Here we can > define this as a Slice operation and records can be easily extracted in a > sink-node. > Sort and Fetch operation applies when we need to execute a Fetch operation on > sorted data. The main issue is we cannot have a sort node followed by a > fetch. The reason is that all existing node definitions supporting sort are > based on sink nodes. Since there cannot be a node followed by sink, this > functionality has to take place in a single node. > But this is not a perfect solution for fetch and sort, but one way to do this > is define a sink node where the records are sorted and then a set of items > are fetched. > Another dilema is what if sort is followed by a fetch. In that case, there > has to be a flag to enable the order of the operations. > The objective of this ticket is to discuss a viable efficient solution and > include new nodes or a method to execute such a logic. > -- This message was sent by Atlassian Jira (v8.20.10#820010)