Re: [D] Acero streaming join support [arrow]

via GitHub Fri, 09 May 2025 01:30:07 -0700


GitHub user severinson edited a discussion: Acero streaming join support


Hey. Thanks again for the help in my last question @amoeba @westonpace. I 
worked around the issue by converting my problem to a join on a single column.

However, I have another question: when doing an inner join on two datasets, 
Acero tries to load the entire left and right table into memory. In what 
circumstances (if any) could I expect Acero to do a streaming join where only 
one of the two datasets is loaded into memory and the other is iterated over in 
chunks?

I’m assuming it tried to load the entire dataset into memory since my program 
crashes due to trying to allocate more than 32G. The failing allocation comes 
from within the executeSerializedPlan native method. 

I’m calling Acero from Java using its jni bindings and the Acero Substrait 
consumer. I’m using Arrow 18. I’m joining a left dataset of about 200MB with a 
right dataset of about 200GB on a UInt8 column. I provide both datasets as Java 
ArrowReader objects to the Acero Substrait consumer.

I appreciate any help. Thanks :)

GitHub link: https://github.com/apache/arrow/discussions/46370

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Acero streaming join support [arrow]

Reply via email to