Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

Jayjeet Chakraborty Thu, 08 Jul 2021 01:44:15 -0700

Thanks for the perf experiments weston !


On 2021/06/14 20:24:07, Weston Pace <weston.p...@gmail.com> wrote: 
> Returning to the main thread...
> 
> From: jayjeetchakrabort...@gmail.com
> 
> > Hi Wes, Gosh, Weston,
> >
> > Sorry if you are recieving this message redundantly, but I tried sending 
> > this message via ponymail twice, but the message didn't went through for 
> > some reason. But anyways, Thanks a lot for the valuable discussion. I 
> > experimented a little with the pre-allocation strategy on my end, although 
> > it worked, I was not able to reproduce the results that Weston got. I used 
> > a +1.3 GB Table (containing NYC taxi data). I serialized it to an 
> > arrow::Buffer once dynamically and once using a 1.4 GB preallocated 
> > arrow::BufferOutputStream. I use the CMake flags here [2] to build arrow 
> > from source.
> >
> > My results are as follows:
> >
> > pre-allocated (1400 * 1024 * 1024 bytes): ~860ms
> >
> > dynamically allocated (starting from 4096 bytes): ~2300ms
> >
> > I saw a >12 times improvement in Weston's experiments, and since I am only 
> > getting ~3 times improvement, I am wondering what am I doing wrong on my 
> > end. I am sharing my benchmark code here [1]. It will be great if someone 
> > could take a > look at it (mainly the Serialize function). Looking forward 
> > to hearing back from you. Thanks again.
> >
> > Best regards,
> > Jayjeet  Chakraborty
> 
> From: gosh...@gmail.com
> 
> > Hi Jayjeet,
> >
> > Abstracting from the real data shape, to me the code looks reasonable at 
> > first glance from the point of view of arrow. However I'd assume that kind 
> > of measurement should be done in some loop(like in google benchmark etc.). 
> > Also, as mentioned on the main mail thread(at least I heard no concerns 
> > about that :) ) you might be observing some lazy page warm up effects.
> >
> > At this point I'd suggest profiling that code as the stack trace will 
> > probably tell you much better where the problem is.
> > Also probably we should try again to return to the main thread for public 
> > record and expanded discussion.
> >
> >  Cheers,
> > Gosh
> 
> Lazy page warmup effects is exactly what I suspect the culprit to be.
> Keep in mind that these can be hard to measure as they don't happen at
> allocation time but actually happen when you first write to the memory
> (only at that point does Linux allocate some real RAM).  As a fun
> experiment, running your program in a loop, I get this perf output
> (measuring cycles) from the first iteration...
> 
>   48.98%  serialize  libc-2.31.so         [.] __memmove_sse2_unaligned_erms
>    9.14%  serialize  [kernel.kallsyms]    [k] clear_page_erms
>    5.60%  serialize  [kernel.kallsyms]    [k] native_irq_return_iret
>    3.91%  serialize  libc-2.31.so         [.] __memset_avx2_erms
>    3.90%  serialize  [kernel.kallsyms]    [k] sync_regs
>    1.98%  serialize  [kernel.kallsyms]    [k] rmqueue
>    1.80%  serialize  [kernel.kallsyms]    [k] __pagevec_lru_add_fn
>    1.41%  serialize  [kernel.kallsyms]    [k] handle_mm_fault
>    1.34%  serialize  [kernel.kallsyms]    [k] __handle_mm_fault
>    1.32%  serialize  [kernel.kallsyms]    [k] get_mem_cgroup_from_mm
>    1.26%  serialize  [kernel.kallsyms]    [k] try_charge
>    0.99%  serialize  [kernel.kallsyms]    [k] do_anonymous_page
> 
> __memmove_sse2_unaligned_erms is inside of memcpy and it is the actual
> workhorse in this example.   Notice that is only taking about 50% of
> the available cycles.  The remaining time is spent in various
> functions that appear to be page allocation (page fault, assign page,
> clear page).  Also, this first iteration has ~3 billion total cycles.
> 
> On the third iteration...
> 
>   98.40%  serialize  libc-2.31.so         [.] __memmove_sse2_unaligned_erms
>    0.44%  :79205     [kernel.kallsyms]    [k] __mod_zone_page_state
>    0.44%  :79205     [kernel.kallsyms]    [k] zap_pte_range.isra.0
>    0.40%  serialize  libc-2.31.so         [.] __memset_avx2_erms
>    0.17%  serialize  libarrow.so.500.0.0  [.] arrow::StringArray::~StringArray
>    0.16%  serialize  [kernel.kallsyms]    [k] cpuacct_account_field
> 
> I only get about 700 million total cycles and the time is dominated by memcpy.
> 
> 
> 
> On Thu, Jun 10, 2021 at 1:21 PM Gosh Arzumanyan <gosh...@gmail.com> wrote:
> >
> > This might help to get the size of the output buffer upfront:
> > https://github.com/apache/arrow/blob/1830d1558be8741e7412f6af30582ff457f0f34f/cpp/src/arrow/io/memory.h#L96
> >
> > Though with "standard" allocators there is a risk of running into
> > KiPageFaults when going for buffers over 1mb. This might be especially
> > painful in multithreaded environment.
> >
> > A custom outputstream with configurable buffering parameter might help to
> > overcome that problem without dealing too much with the allocators.
> > Curious to hear community thoughts on this.
> >
> > Cheers,
> > Gosh
> >
> > On Fri., 11 Jun. 2021, 00:45 Wes McKinney, <wesmck...@gmail.com> wrote:
> >
> > > From this, it seems like seeding the RecordBatchStreamWriter's output
> > > stream with a much larger preallocated buffer would improve
> > > performance (depends on the allocator used of course).
> > >
> > > On Thu, Jun 10, 2021 at 5:40 PM Weston Pace <weston.p...@gmail.com> wrote:
> > > >
> > > > Just for some reference times from my system I created a quick test to
> > > > dump a ~1.7GB table to buffer(s).
> > > >
> > > > Going to many buffers (just collecting the buffers): ~11,000ns
> > > > Going to one preallocated buffer: ~160,000,000ns
> > > > Going to one dynamically allocated buffer (using a grow factor of 2x):
> > > > ~2,000,000,000ns
> > > >
> > > > On Thu, Jun 10, 2021 at 11:46 AM Wes McKinney <wesmck...@gmail.com>
> > > wrote:
> > > > >
> > > > > To be clear, we would like to help make this faster. I don't recall
> > > > > much effort being invested in optimizing this code path in the last
> > > > > couple of years, so there may be some low hanging fruit to improve the
> > > > > performance. Changing the in-memory data layout (the chunking) is one
> > > > > of the most likely things to help.
> > > > >
> > > > > On Thu, Jun 10, 2021 at 2:14 PM Gosh Arzumanyan <gosh...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > Hi Jayjeet,
> > > > > >
> > > > > > I wonder if you really need to serialize the whole table into a
> > > single
> > > > > > buffer as you will end up with twice the memory while you could be
> > > sending
> > > > > > chunks as they are generated by the  RecordBatchStreamWriter. Also
> > > is the
> > > > > > buffer resized beforehand? I'd suspect there might be relocations
> > > happening
> > > > > > under the hood.
> > > > > >
> > > > > >
> > > > > > Cheers,
> > > > > > Gosh
> > > > > >
> > > > > > On Thu., 10 Jun. 2021, 21:01 Wes McKinney, <wesmck...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > hi Jayjeet — have you run prof to see where those 1000ms are being
> > > > > > > spent? How many arrays (the sum of the number of chunks across all
> > > > > > > columns) in total are there? I would guess that the problem is all
> > > the
> > > > > > > little Buffer memcopies. I don't think that the C Interface is
> > > going
> > > > > > > to help you.
> > > > > > >
> > > > > > > - Wes
> > > > > > >
> > > > > > > On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty
> > > > > > > <jayjeetchakrabort...@gmail.com> wrote:
> > > > > > > >
> > > > > > > > Hello Arrow Community,
> > > > > > > >
> > > > > > > > I am a student working on a project where I need to serialize an
> > > > > > > in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I
> > > am
> > > > > > > currently using the arrow::ipc::RecordBatchStreamWriter API to
> > > serialize
> > > > > > > the table to a arrow::Buffer, but it is taking nearly 1000ms to
> > > serialize
> > > > > > > the whole table, and that is harming the performance of my
> > > > > > > performance-critical application. I basically want to get hold of
> > > the
> > > > > > > underlying memory of the table as bytes and send it over the
> > > network. How
> > > > > > > do you suggest I tackle this problem? I was thinking of using the
> > > C Data
> > > > > > > interface for this, so that I convert my arrow::Table to
> > > ArrowArray and
> > > > > > > ArrowSchema and serialize the structs to send them over the
> > > network, but
> > > > > > > seems like serializing structs is another complex problem on its
> > > own.  It
> > > > > > > will be great to have some suggestions on this. Thanks a lot.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jayjeet
> > > > > > > >
> > > > > > >
> > >
>

Re: Discuss a very fast way to serialize a large in-memory Arrow IPC table to a void* buffer for sending over the network

Reply via email to