Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-07-08 Thread Jayjeet Chakraborty
Thanks for the perf experiments weston ! On 2021/06/14 20:24:07, Weston Pace wrote: > Returning to the main thread... > > From: jayjeetchakrabort...@gmail.com > > > Hi Wes, Gosh, Weston, > > > > Sorry if you are recieving this message redundantly, but I tried sending > > this message via pony

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-14 Thread Weston Pace
Returning to the main thread... From: jayjeetchakrabort...@gmail.com > Hi Wes, Gosh, Weston, > > Sorry if you are recieving this message redundantly, but I tried sending this > message via ponymail twice, but the message didn't went through for some > reason. But anyways, Thanks a lot for the v

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Gosh Arzumanyan
This might help to get the size of the output buffer upfront: https://github.com/apache/arrow/blob/1830d1558be8741e7412f6af30582ff457f0f34f/cpp/src/arrow/io/memory.h#L96 Though with "standard" allocators there is a risk of running into KiPageFaults when going for buffers over 1mb. This might be es

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Wes McKinney
>From this, it seems like seeding the RecordBatchStreamWriter's output stream with a much larger preallocated buffer would improve performance (depends on the allocator used of course). On Thu, Jun 10, 2021 at 5:40 PM Weston Pace wrote: > > Just for some reference times from my system I created a

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Weston Pace
Just for some reference times from my system I created a quick test to dump a ~1.7GB table to buffer(s). Going to many buffers (just collecting the buffers): ~11,000ns Going to one preallocated buffer: ~160,000,000ns Going to one dynamically allocated buffer (using a grow factor of 2x): ~2,000,000

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Wes McKinney
To be clear, we would like to help make this faster. I don't recall much effort being invested in optimizing this code path in the last couple of years, so there may be some low hanging fruit to improve the performance. Changing the in-memory data layout (the chunking) is one of the most likely thi

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Gosh Arzumanyan
Hi Jayjeet, I wonder if you really need to serialize the whole table into a single buffer as you will end up with twice the memory while you could be sending chunks as they are generated by the RecordBatchStreamWriter. Also is the buffer resized beforehand? I'd suspect there might be relocations

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Wes McKinney
hi Jayjeet — have you run prof to see where those 1000ms are being spent? How many arrays (the sum of the number of chunks across all columns) in total are there? I would guess that the problem is all the little Buffer memcopies. I don't think that the C Interface is going to help you. - Wes On T

Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

2021-06-10 Thread Jayjeet Chakraborty
Hello Arrow Community, I am a student working on a project where I need to serialize an in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I am currently using the arrow::ipc::RecordBatchStreamWriter API to serialize the table to a arrow::Buffer, but it is taking nearly 1000ms to