Yes, I think you mean this post by Weston<https://lists.apache.org/thread/llfm5dfh2988w2w4j6off417w9szp1tg>. I'll look into adding this sequential-option to source-node and report back.
Yaron. ________________________________ From: Li Jin <ice.xell...@gmail.com> Sent: Monday, July 25, 2022 11:39 AM To: dev@arrow.apache.org <dev@arrow.apache.org> Subject: Re: [C++] Clarifying the behavior of source node and executor Now I think about it more. Weston has probably answered this in another mailing thread that this is not guaranteed and the observation of batches becoming out of file reader + source node happened by chance. Perhaps we can look into adding an option to Source node to ensure "sequential".. Li On Mon, Jul 25, 2022 at 11:18 AM Yaron Gvili <rt...@hotmail.com> wrote: > I've also been using source node with a generator, but observed batches in > random order (in a 1-to-2-months old version of Arrow). So, I'd be > surprised if ordering is guaranteed, and I'm also interested in how to > obtain such a guarantee. > > > Yaron. > ________________________________ > From: Li Jin <ice.xell...@gmail.com> > Sent: Monday, July 25, 2022 11:10 AM > To: dev@arrow.apache.org <dev@arrow.apache.org> > Subject: Re: [C++] Clarifying the behavior of source node and executor > > Sorry the link to the generator above is wrong - We traced into the code > and found it uses BackgroundGenerator: > > https://github.com/apache/arrow/blob/78fb2edd30b602bd54702896fa78d36ec6fefc8c/cpp/src/arrow/util/async_generator.h#L1581 > > On Mon, Jul 25, 2022 at 11:07 AM Li Jin <ice.xell...@gmail.com> wrote: > > > Hi, > > > > Ivan and I are debugging some behavior of the source node this morning > and > > I was hoping to clarify that our understanding is correct. > > > > We observed that when using source node with a generator: > > > > > https://github.com/apache/arrow/blob/66c66d040bbf81a4819b276aee306625dc02837c/cpp/src/arrow/compute/exec/options.h#L54 > > > > The source node becomes "sequential" (batches come out in order one at a > > time) even with a GetCpuThreadPool() attached to the plan. > > > > We traced the code into this class: > > > > > https://github.com/apache/arrow/blob/78fb2edd30b602bd54702896fa78d36ec6fefc8c/cpp/src/arrow/util/async_generator.h#L316 > > > > And it seems like because of the synchronization of this class, it > > generates batches sequentially. Is this correct understanding and if it > is > > intentional that the source node are sequential when backed by a > > generator? (This is actually the behavior that we want) > > >