[ https://issues.apache.org/jira/browse/ARROW-12873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394991#comment-17394991 ]
David Li commented on ARROW-12873: ---------------------------------- So I tried implementing an arg_min_max node as part of ARROW-13540, using virtual columns to tag batches with their relative order (though it's not finished yet, I'll try to have it pushed on Monday). This is an aggregation which needs to know its inputs are ordered in some way. Having the metadata as actual columns makes some things difficult: we need to branch when looking up the aggregate kernels, we have to register kernels with different arities (and hence duplicate or refactor some of the utilities used there), and we need to branch when feeding data into the kernels. Also, the GroupByNode and OrderByNode have to hardcode the position of the virtual column and ensure that they are consistent with each other (and any possible intermediate node needs to pass it forward) - this is *very* brittle, especially if/when we want to go back and add more metadata. In effect, virtual columns mean all node implementations are tightly coupled. However, I don't think having separate metadata improves this much, as it instead carries the risk that the metadata isn't correctly updated as batches are manipulated. In effect, I don't think the metadata can feasibly be runtime extensible, and having any metadata will end up coupling node implementations in some way. So I agree with Antoine and Weston, and might go further and say that we necessarily have to know all the possible categories of metadata ahead of time, since nodes have to know what to do with it anyways, and this discussion might be moot (everything can just be explicit fields in ExecBatch). If there is format- or backend- specific information to carry (Felipe's point), an extension point in ExecBatch might be useful still, but it's hard to imagine how a generic ExecNode could know what to do with that metadata safely - I would say that custom metadata needs to be accompanied by custom ExecNodes, which pass the metadata around themselves and can't 'leak' it outside a subgraph consisting solely of these nodes. > [C++][Compute] Support tagging ExecBatches with arbitrary extra information > --------------------------------------------------------------------------- > > Key: ARROW-12873 > URL: https://issues.apache.org/jira/browse/ARROW-12873 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Ben Kietzman > Priority: Major > > Ideally, ExecBatches could be tagged with arbitrary optional objects for > tracing purposes and to transmit execution hints from one ExecNode to another. > These should *not* be explicit members like ExecBatch::selection_vector is, > since they may not originate from the arrow library. For an example within > the arrow project: {{libarrow_dataset}} will be used to produce ScanNodes and > a WriteNodes and it's useful to tag scanned batches with their {{Fragment}} > of origin. However adding {{ExecBatch::fragment}} would result in a cyclic > dependency. > To facilitate this tagging capability, we would need a type erased container > something like > {code} > struct AnySet { > void* Get(tag_t tag); > void Set(tag_t tag, void* value, FnOnce<void(void*)> destructor); > }; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)