Re: C++ optimize stream output

Wes McKinney Thu, 15 Mar 2018 17:39:07 -0700

hi Rares,

sorry for the delay in writing back. To your questions


   - Does this look good decent? Could the API be used in more efficient
   ways in order to achieve the same goal?

This seems perfectly reasonable with the API as it is now

   - On each node side, do steps #3 and #7 copy data?

Step 3 does not copy data, but step 7 does. You could do better by
writing directly to the network protocol instead of using
io::BufferOutputStream (which accumulates data in an in-memory
buffer). It would be useful to have a buffered writer to limit the
number of writes to the socket (or whatever protocol is being used).
Feel free to open a JIRA about this

   - On the coordinator side, do steps #3.2, and #3.3 copy data?

Per ARROW-2189 -- do you get these problems if you build from source?
I'm trying to understand if this is a packaging issue or a code issue

Step 3.2 does not copy data Step 3.3 does write data to the FileOutputStream

   - On the coordinator side, do I really need to read and write a record
   batch? Could I copy the buffer directly somehow?

No, you don't need to necessarily. The idea of the Message* classes in
arrow::ipc is to facilitate transporting messages while being agnostic
to their comments. This would be a useful test case to flesh out these
APIs. Could you please open some JIRAs about this? There are already
some Message-related JIRAs open so take a look at what is there
already.

Thanks
Wes


On Mon, Feb 26, 2018 at 10:22 AM, Rares Vernica <rvern...@gmail.com> wrote:
> Hello,
>
> I am using the C++ API to serialize and centralize data over the network. I
> am wondering if I am using the API in an efficient way.
>
> I have multiple nodes and a coordinator communicating over the network. I
> do not have fine control over the network communication. Individual nodes
> write one chunk of data to the network. The coordinator will receive all
> the chunks and can loop over them.
>
> On each node I do the following (the code is here
> <https://github.com/Paradigm4/accelerated_io_tools/blob/e02aa37eb464d2eae501a36e4297adb28467f311/src/PhysicalAioSave.cpp#L512>
> ):
>
>    1. Append data to Builders
>    2. Finish Builders and get Arrays
>    3. Create Record Batch from Arrays
>    4. Create Pool Buffer
>    5. Create Buffer Output Stream using Pool Buffer
>    6. Open Record Batch Stream Writer using Buffer Output Stream
>    7. Write Record Batch to writer
>    8. Write Buffer data to network
>
> On the coordinator I do the following (the code is here
> <https://github.com/Paradigm4/accelerated_io_tools/blob/e02aa37eb464d2eae501a36e4297adb28467f311/src/PhysicalAioSave.cpp#L1000>
> ):
>
>    1. Open File Output Stream
>    2. Open Record Batch Stream Writer using File Output Stream
>    3. For each chunk retrieved from the network
>       1. Create Buffer Reader using data retrieved from the network (the
>       code is here
>       
> <https://github.com/Paradigm4/accelerated_io_tools/blob/e02aa37eb464d2eae501a36e4297adb28467f311/src/PhysicalAioSave.cpp#L1054>
>       )
>       2. Create Record Batch Stream Reader using Buffer Reader and read
>       Record Batch (I plan to use ReadRecordBatch, but I'm having
> issues like in
>       ARROW-2189)
>       3. Write Record Batch
>
> A few questions:
>
>    - Does this look good decent? Could the API be used in more efficient
>    ways in order to achieve the same goal?
>    - On each node side, do steps #3 and #7 copy data?
>    - On the coordinator side, do steps #3.2, and #3.3 copy data?
>    - On the coordinator side, do I really need to read and write a record
>    batch? Could I copy the buffer directly somehow?
>
> Thank you so much!
> Rares
>
>
>
> (Could the same Pool Buffer be reused across calls?)

Re: C++ optimize stream output

Reply via email to