GitHub user severinson edited a discussion: Acero streaming join support
Hey. Thanks again for the help in my last question @amoeba @westonpace. I worked around the issue by converting my problem to a join on a single column. However, I have another question: when doing an inner join on two datasets, Acero tries to load the entire left and right table into memory. In what circumstances (if any) could I expect Acero to do a streaming join where only one of the two datasets is loaded into memory and the other is iterated over in chunks? I’m assuming it tried to load the entire dataset into memory since my program crashes due to trying to allocate more than 32G. The failing allocation comes from within the executeSerializedPlan native method. I’m calling Acero from Java using its jni bindings and the Acero Substrait consumer. I’m using Arrow 18. I’m joining a left dataset of about 200MB with a right dataset of about 200GB on a UInt8 column. I provide both datasets as Java ArrowReader objects to the Acero Substrait consumer. I appreciate any help. Thanks :) GitHub link: https://github.com/apache/arrow/discussions/46370 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
