Dandandan commented on pull request #8983:
URL: https://github.com/apache/arrow/pull/8983#issuecomment-749744053


   @jorgecarleitao 
   
   Do you maybe have an idea in which ways this could be solved?
   The _most efficient_ way I think / read about would to keep a boolean / bit 
vector per element or key on the left and just scan / filter the ones at the 
end that are not marked and produce those extra rows.
   
   Maybe an easier intermediate solution would be to iterate over the batches 
and update a  `Hashset<Vec<u8>>` or something like currently is done _per 
batch_.
   
   But I'm not sure how both options fit in the current design (with 
`SendableRecordBatchStream`, poll next, etc).
   
   So if you have some hints would be very appreciated :) 
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to