Thanks for the perf experiments weston !
On 2021/06/14 20:24:07, Weston Pace <weston.p...@gmail.com> wrote: > Returning to the main thread... > > From: jayjeetchakrabort...@gmail.com > > > Hi Wes, Gosh, Weston, > > > > Sorry if you are recieving this message redundantly, but I tried sending > > this message via ponymail twice, but the message didn't went through for > > some reason. But anyways, Thanks a lot for the valuable discussion. I > > experimented a little with the pre-allocation strategy on my end, although > > it worked, I was not able to reproduce the results that Weston got. I used > > a +1.3 GB Table (containing NYC taxi data). I serialized it to an > > arrow::Buffer once dynamically and once using a 1.4 GB preallocated > > arrow::BufferOutputStream. I use the CMake flags here [2] to build arrow > > from source. > > > > My results are as follows: > > > > pre-allocated (1400 * 1024 * 1024 bytes): ~860ms > > > > dynamically allocated (starting from 4096 bytes): ~2300ms > > > > I saw a >12 times improvement in Weston's experiments, and since I am only > > getting ~3 times improvement, I am wondering what am I doing wrong on my > > end. I am sharing my benchmark code here [1]. It will be great if someone > > could take a > look at it (mainly the Serialize function). Looking forward > > to hearing back from you. Thanks again. > > > > Best regards, > > Jayjeet Chakraborty > > From: gosh...@gmail.com > > > Hi Jayjeet, > > > > Abstracting from the real data shape, to me the code looks reasonable at > > first glance from the point of view of arrow. However I'd assume that kind > > of measurement should be done in some loop(like in google benchmark etc.). > > Also, as mentioned on the main mail thread(at least I heard no concerns > > about that :) ) you might be observing some lazy page warm up effects. > > > > At this point I'd suggest profiling that code as the stack trace will > > probably tell you much better where the problem is. > > Also probably we should try again to return to the main thread for public > > record and expanded discussion. > > > > Cheers, > > Gosh > > Lazy page warmup effects is exactly what I suspect the culprit to be. > Keep in mind that these can be hard to measure as they don't happen at > allocation time but actually happen when you first write to the memory > (only at that point does Linux allocate some real RAM). As a fun > experiment, running your program in a loop, I get this perf output > (measuring cycles) from the first iteration... > > 48.98% serialize libc-2.31.so [.] __memmove_sse2_unaligned_erms > 9.14% serialize [kernel.kallsyms] [k] clear_page_erms > 5.60% serialize [kernel.kallsyms] [k] native_irq_return_iret > 3.91% serialize libc-2.31.so [.] __memset_avx2_erms > 3.90% serialize [kernel.kallsyms] [k] sync_regs > 1.98% serialize [kernel.kallsyms] [k] rmqueue > 1.80% serialize [kernel.kallsyms] [k] __pagevec_lru_add_fn > 1.41% serialize [kernel.kallsyms] [k] handle_mm_fault > 1.34% serialize [kernel.kallsyms] [k] __handle_mm_fault > 1.32% serialize [kernel.kallsyms] [k] get_mem_cgroup_from_mm > 1.26% serialize [kernel.kallsyms] [k] try_charge > 0.99% serialize [kernel.kallsyms] [k] do_anonymous_page > > __memmove_sse2_unaligned_erms is inside of memcpy and it is the actual > workhorse in this example. Notice that is only taking about 50% of > the available cycles. The remaining time is spent in various > functions that appear to be page allocation (page fault, assign page, > clear page). Also, this first iteration has ~3 billion total cycles. > > On the third iteration... > > 98.40% serialize libc-2.31.so [.] __memmove_sse2_unaligned_erms > 0.44% :79205 [kernel.kallsyms] [k] __mod_zone_page_state > 0.44% :79205 [kernel.kallsyms] [k] zap_pte_range.isra.0 > 0.40% serialize libc-2.31.so [.] __memset_avx2_erms > 0.17% serialize libarrow.so.500.0.0 [.] arrow::StringArray::~StringArray > 0.16% serialize [kernel.kallsyms] [k] cpuacct_account_field > > I only get about 700 million total cycles and the time is dominated by memcpy. > > > > On Thu, Jun 10, 2021 at 1:21 PM Gosh Arzumanyan <gosh...@gmail.com> wrote: > > > > This might help to get the size of the output buffer upfront: > > https://github.com/apache/arrow/blob/1830d1558be8741e7412f6af30582ff457f0f34f/cpp/src/arrow/io/memory.h#L96 > > > > Though with "standard" allocators there is a risk of running into > > KiPageFaults when going for buffers over 1mb. This might be especially > > painful in multithreaded environment. > > > > A custom outputstream with configurable buffering parameter might help to > > overcome that problem without dealing too much with the allocators. > > Curious to hear community thoughts on this. > > > > Cheers, > > Gosh > > > > On Fri., 11 Jun. 2021, 00:45 Wes McKinney, <wesmck...@gmail.com> wrote: > > > > > From this, it seems like seeding the RecordBatchStreamWriter's output > > > stream with a much larger preallocated buffer would improve > > > performance (depends on the allocator used of course). > > > > > > On Thu, Jun 10, 2021 at 5:40 PM Weston Pace <weston.p...@gmail.com> wrote: > > > > > > > > Just for some reference times from my system I created a quick test to > > > > dump a ~1.7GB table to buffer(s). > > > > > > > > Going to many buffers (just collecting the buffers): ~11,000ns > > > > Going to one preallocated buffer: ~160,000,000ns > > > > Going to one dynamically allocated buffer (using a grow factor of 2x): > > > > ~2,000,000,000ns > > > > > > > > On Thu, Jun 10, 2021 at 11:46 AM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > > To be clear, we would like to help make this faster. I don't recall > > > > > much effort being invested in optimizing this code path in the last > > > > > couple of years, so there may be some low hanging fruit to improve the > > > > > performance. Changing the in-memory data layout (the chunking) is one > > > > > of the most likely things to help. > > > > > > > > > > On Thu, Jun 10, 2021 at 2:14 PM Gosh Arzumanyan <gosh...@gmail.com> > > > wrote: > > > > > > > > > > > > Hi Jayjeet, > > > > > > > > > > > > I wonder if you really need to serialize the whole table into a > > > single > > > > > > buffer as you will end up with twice the memory while you could be > > > sending > > > > > > chunks as they are generated by the RecordBatchStreamWriter. Also > > > is the > > > > > > buffer resized beforehand? I'd suspect there might be relocations > > > happening > > > > > > under the hood. > > > > > > > > > > > > > > > > > > Cheers, > > > > > > Gosh > > > > > > > > > > > > On Thu., 10 Jun. 2021, 21:01 Wes McKinney, <wesmck...@gmail.com> > > > wrote: > > > > > > > > > > > > > hi Jayjeet — have you run prof to see where those 1000ms are being > > > > > > > spent? How many arrays (the sum of the number of chunks across all > > > > > > > columns) in total are there? I would guess that the problem is all > > > the > > > > > > > little Buffer memcopies. I don't think that the C Interface is > > > going > > > > > > > to help you. > > > > > > > > > > > > > > - Wes > > > > > > > > > > > > > > On Thu, Jun 10, 2021 at 1:48 PM Jayjeet Chakraborty > > > > > > > <jayjeetchakrabort...@gmail.com> wrote: > > > > > > > > > > > > > > > > Hello Arrow Community, > > > > > > > > > > > > > > > > I am a student working on a project where I need to serialize an > > > > > > > in-memory Arrow Table of size around 700MB to a uint8_t* buffer. I > > > am > > > > > > > currently using the arrow::ipc::RecordBatchStreamWriter API to > > > serialize > > > > > > > the table to a arrow::Buffer, but it is taking nearly 1000ms to > > > serialize > > > > > > > the whole table, and that is harming the performance of my > > > > > > > performance-critical application. I basically want to get hold of > > > the > > > > > > > underlying memory of the table as bytes and send it over the > > > network. How > > > > > > > do you suggest I tackle this problem? I was thinking of using the > > > C Data > > > > > > > interface for this, so that I convert my arrow::Table to > > > ArrowArray and > > > > > > > ArrowSchema and serialize the structs to send them over the > > > network, but > > > > > > > seems like serializing structs is another complex problem on its > > > own. It > > > > > > > will be great to have some suggestions on this. Thanks a lot. > > > > > > > > > > > > > > > > Best, > > > > > > > > Jayjeet > > > > > > > > > > > > > > > > > > >