> Can anyone guide what's the best practice here and if my
below understandings are correct:

Here's an example:
https://gist.github.com/westonpace/6f7fdbdc0399501418101851d75091c4

> I receive streaming data via a callback function which gives me data row
by row. To my best knowledge, Subclassing RecordBatchReader is preferred?

RecordBatchReader is pull-based.  A friendly push-based source node for
Acero would be a good idea, but doesn't exist.  You can make one using a
PushGenerator but that might be complicated.  In the meantime, subclassing
RecordBatchReader is probably simplest.

> Should I batch a fixed number rows in some in memory data structure
first, then flush them to acero?

Yes, for performance.

> Then how could acero know it's time to push data in ReadNext
<https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow17RecordBatchReader8ReadNextEPNSt10shared_ptrI11RecordBatchEE>
 function?

ReadNext is pull based.  Acero will continually call ReadNext.  You should
block until sufficient data is available.

> I have a question on how to use the Acero push model to write streaming
data as hive partitioning Parquet in a single thread program

The example I gave will use one thread in addition to the main thread.
Doing something without any additional threads at all would be possible,
but would probably require knowing more about the structure of your program.

On Thu, May 18, 2023 at 3:27 PM Haocheng Liu <lbtin...@gmail.com> wrote:

> Hi,
>
> I have a question on how to use the Acero push model to write streaming
> data as hive partitioning Parquet in a single thread program. Can anyone
> guide what's the best practice here and if my below understandings are
> correct:
>
>    - I receive streaming data via a callback function which gives me data
>    row by row. To my best knowledge, Subclassing RecordBatchReader is
>    preferred?
>    - Should I batch a fixed number rows in some in memory data structure
>    first, then flush them to acero? Then how could acero know it's time to
>    push data in ReadNext
>    
> <https://arrow.apache.org/docs/cpp/api/table.html#_CPPv4N5arrow17RecordBatchReader8ReadNextEPNSt10shared_ptrI11RecordBatchEE>
>     function?
>
> I'm not clear on how to connect a call back function from streaming data
> with Aecro push model. Any suggestions will be appreciated.
>
>
> Thanks.
>
> Best,
> Haocheng
>

Reply via email to