Dandandan commented on pull request #8983: URL: https://github.com/apache/arrow/pull/8983#issuecomment-749744053
@jorgecarleitao Do you maybe have an idea in which ways this could be solved? The _most efficient_ way I think / read about would to keep a boolean / bit vector per element or key on the left and just scan / filter the ones at the end that are not marked and produce those extra rows. Maybe an easier intermediate solution would be to iterate over the batches and update a `Hashset<Vec<u8>>` or something like currently is done _per batch_. But I'm not sure how both options fit in the current design (with `SendableRecordBatchStream`, poll next, etc). So if you have some hints would be very appreciated :) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org