hi Rares, sorry for the delay in writing back. To your questions
- Does this look good decent? Could the API be used in more efficient ways in order to achieve the same goal? This seems perfectly reasonable with the API as it is now - On each node side, do steps #3 and #7 copy data? Step 3 does not copy data, but step 7 does. You could do better by writing directly to the network protocol instead of using io::BufferOutputStream (which accumulates data in an in-memory buffer). It would be useful to have a buffered writer to limit the number of writes to the socket (or whatever protocol is being used). Feel free to open a JIRA about this - On the coordinator side, do steps #3.2, and #3.3 copy data? Per ARROW-2189 -- do you get these problems if you build from source? I'm trying to understand if this is a packaging issue or a code issue Step 3.2 does not copy data Step 3.3 does write data to the FileOutputStream - On the coordinator side, do I really need to read and write a record batch? Could I copy the buffer directly somehow? No, you don't need to necessarily. The idea of the Message* classes in arrow::ipc is to facilitate transporting messages while being agnostic to their comments. This would be a useful test case to flesh out these APIs. Could you please open some JIRAs about this? There are already some Message-related JIRAs open so take a look at what is there already. Thanks Wes On Mon, Feb 26, 2018 at 10:22 AM, Rares Vernica <rvern...@gmail.com> wrote: > Hello, > > I am using the C++ API to serialize and centralize data over the network. I > am wondering if I am using the API in an efficient way. > > I have multiple nodes and a coordinator communicating over the network. I > do not have fine control over the network communication. Individual nodes > write one chunk of data to the network. The coordinator will receive all > the chunks and can loop over them. > > On each node I do the following (the code is here > <https://github.com/Paradigm4/accelerated_io_tools/blob/e02aa37eb464d2eae501a36e4297adb28467f311/src/PhysicalAioSave.cpp#L512> > ): > > 1. Append data to Builders > 2. Finish Builders and get Arrays > 3. Create Record Batch from Arrays > 4. Create Pool Buffer > 5. Create Buffer Output Stream using Pool Buffer > 6. Open Record Batch Stream Writer using Buffer Output Stream > 7. Write Record Batch to writer > 8. Write Buffer data to network > > On the coordinator I do the following (the code is here > <https://github.com/Paradigm4/accelerated_io_tools/blob/e02aa37eb464d2eae501a36e4297adb28467f311/src/PhysicalAioSave.cpp#L1000> > ): > > 1. Open File Output Stream > 2. Open Record Batch Stream Writer using File Output Stream > 3. For each chunk retrieved from the network > 1. Create Buffer Reader using data retrieved from the network (the > code is here > > <https://github.com/Paradigm4/accelerated_io_tools/blob/e02aa37eb464d2eae501a36e4297adb28467f311/src/PhysicalAioSave.cpp#L1054> > ) > 2. Create Record Batch Stream Reader using Buffer Reader and read > Record Batch (I plan to use ReadRecordBatch, but I'm having > issues like in > ARROW-2189) > 3. Write Record Batch > > A few questions: > > - Does this look good decent? Could the API be used in more efficient > ways in order to achieve the same goal? > - On each node side, do steps #3 and #7 copy data? > - On the coordinator side, do steps #3.2, and #3.3 copy data? > - On the coordinator side, do I really need to read and write a record > batch? Could I copy the buffer directly somehow? > > Thank you so much! > Rares > > > > (Could the same Pool Buffer be reused across calls?)