sundy-li commented on pull request #9602: URL: https://github.com/apache/arrow/pull/9602#issuecomment-792216219
> For example, using a priority queue to keep only the top k values in memory. Yes, but lots of codes may duplicate with sort kernel. partial_sort used priority queue inside. It maybe good to do sorting in pipeline OLAP systems In ClickHouse, PartialSortingTransform(Each block in each thread) --> MergeSortingTransform(Blocks to one block in each thread) --> MergingSortedTransform(N Block in N Thread to one block) . ``` ┌─explain────────────────────────────────┐ │ (Expression) │ │ ExpressionTransform │ │ (Limit) │ │ Limit │ │ (MergingSorted) │ │ MergingSortedTransform 16 → 1 │ │ (MergeSorting) │ │ MergeSortingTransform × 16 │ │ (PartialSorting) │ │ LimitsCheckingTransform × 16 │ │ PartialSortingTransform × 16 │ │ (Expression) │ │ ExpressionTransform × 16 │ │ (SettingQuotaAndLimits) │ │ (ReadFromStorage) │ │ NumbersMt × 16 0 → 1 │ └────────────────────────────────────────┘ ``` @alamb @jorgecarleitao Thanks for all your reviews. I also have the consideration about unsafe codes in partial_sort may break `arrow`, because it was just created, without any used in production(BTW I am new to rust). We can keep this MR open currently until you think it's safe enough or must have it. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
