[ 
https://issues.apache.org/jira/browse/ARROW-12873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17394991#comment-17394991
 ] 

David Li commented on ARROW-12873:
----------------------------------

So I tried implementing an arg_min_max node as part of ARROW-13540, using 
virtual columns to tag batches with their relative order (though it's not 
finished yet, I'll try to have it pushed on Monday). This is an aggregation 
which needs to know its inputs are ordered in some way.

Having the metadata as actual columns makes some things difficult: we need to 
branch when looking up the aggregate kernels, we have to register kernels with 
different arities (and hence duplicate or refactor some of the utilities used 
there), and we need to branch when feeding data into the kernels. Also, the 
GroupByNode and OrderByNode have to hardcode the position of the virtual column 
and ensure that they are consistent with each other (and any possible 
intermediate node needs to pass it forward) - this is *very* brittle, 
especially if/when we want to go back and add more metadata. In effect, virtual 
columns mean all node implementations are tightly coupled.

However, I don't think having separate metadata improves this much, as it 
instead carries the risk that the metadata isn't correctly updated as batches 
are manipulated. In effect, I don't think the metadata can feasibly be runtime 
extensible, and having any metadata will end up coupling node implementations 
in some way. So I agree with Antoine and Weston, and might go further and say 
that we necessarily have to know all the possible categories of metadata ahead 
of time, since nodes have to know what to do with it anyways, and this 
discussion might be moot (everything can just be explicit fields in ExecBatch). 
If there is format- or backend- specific information to carry (Felipe's point), 
an extension point in ExecBatch might be useful still, but it's hard to imagine 
how a generic ExecNode could know what to do with that metadata safely - I 
would say that custom metadata needs to be accompanied by custom ExecNodes, 
which pass the metadata around themselves and can't 'leak' it outside a 
subgraph consisting solely of these nodes.

> [C++][Compute] Support tagging ExecBatches with arbitrary extra information
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-12873
>                 URL: https://issues.apache.org/jira/browse/ARROW-12873
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Ben Kietzman
>            Priority: Major
>
> Ideally, ExecBatches could be tagged with arbitrary optional objects for 
> tracing purposes and to transmit execution hints from one ExecNode to another.
> These should *not* be explicit members like ExecBatch::selection_vector is, 
> since they may not originate from the arrow library. For an example within 
> the arrow project: {{libarrow_dataset}} will be used to produce ScanNodes and 
> a WriteNodes and it's useful to tag scanned batches with their {{Fragment}} 
> of origin. However adding {{ExecBatch::fragment}} would result in a cyclic 
> dependency.
> To facilitate this tagging capability, we would need a type erased container 
> something like
> {code}
> struct AnySet {
>   void* Get(tag_t tag);
>   void Set(tag_t tag, void* value, FnOnce<void(void*)> destructor);
> };
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to