[ https://issues.apache.org/jira/browse/ARROW-12873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395006#comment-17395006 ]
Weston Pace commented on ARROW-12873: ------------------------------------- I thought the original proposal was tagging record batches with arbitrary void* pointers. It's possible I'm not explaining myself well. If you'll allow some psuedocode here to avoid the complexity of exec plan... What we have today is: {code:python} exec_batch_with_order_at_back = order_by_node(in_batch) grouped_output = group_by_node(exec_batch_with_order_at_back, kernel_that_can_use_order) def group_by_node(batch, agg_kernel): group_ids = grouper(batch) mashed_together_batch = {group_ids, batch} if can_use_order(agg_kernel): agg_kernel(mashed_together_batch, mashed_together_batch[-1]) else: agg_kernel(mashed_together_batch) {code} I'm proposing (and this may not make any sense at all): {code:python} exec_batch, order = order_by_node(in_batch) grouped_output = group_by_node(exec_batch, kernel_that_can_use_order, extra_inputs=[order]) def group_by_node(batch, agg_kernel, extra_inputs=[]): group_ids = grouper(batch) mashed_together_batch = {group_ids, batch} agg_kernel(mashed_together_batch, *extra_inputs) {code} The kernels still need different arities which I think is ok, but you don't have to do the branching. Also, is there a reason (there probably is) we don't require aggregate kernels to be 2+ arity: {code:python} def group_by_node(batch, agg_kernel, extra_inputs=[]): group_ids = grouper(batch) agg_kernel(batch, group_ids, *extra_inputs) {code} > [C++][Compute] Support tagging ExecBatches with arbitrary extra information > --------------------------------------------------------------------------- > > Key: ARROW-12873 > URL: https://issues.apache.org/jira/browse/ARROW-12873 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Ben Kietzman > Priority: Major > > Ideally, ExecBatches could be tagged with arbitrary optional objects for > tracing purposes and to transmit execution hints from one ExecNode to another. > These should *not* be explicit members like ExecBatch::selection_vector is, > since they may not originate from the arrow library. For an example within > the arrow project: {{libarrow_dataset}} will be used to produce ScanNodes and > a WriteNodes and it's useful to tag scanned batches with their {{Fragment}} > of origin. However adding {{ExecBatch::fragment}} would result in a cyclic > dependency. > To facilitate this tagging capability, we would need a type erased container > something like > {code} > struct AnySet { > void* Get(tag_t tag); > void Set(tag_t tag, void* value, FnOnce<void(void*)> destructor); > }; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)