Re: [I] Parquet deserialization speeds slower on Linux [arrow]

via GitHub Wed, 22 Nov 2023 08:11:56 -0800


mrocklin commented on issue #38389:
URL: https://github.com/apache/arrow/issues/38389#issuecomment-1823066197


   @eeroel that's an interesting issue.  Thank you for sharing and fixing.  The 
chart you share makes the problem super-clear.  I'll be curious to see how it 
impacts observed performance.
   
   I'm still curious about the high CPU-utilization.  I was chatting separately 
with Wes and he mentioned the following (I don't think I'm sharing anythign 
private)
   
   > I would guess that collecting C++ profile data via perf/flamegraph would 
tell an interesting story about what's taking up the most time in the Parquet 
deserialization (you generally need to recompile things with 
-fno-omit-frame-pointer to get reasonable looking profiles). Snappy is a common 
culprit 
   
   Maybe I should try with non-snappy-compressed data and see how things 
behave.  That would probably help to bisect the problem space.  
   
   I'm going to summarize a few possible directions to investigate that came 
out of this discussion:
   
   1. Try again after #38591 
   2. Try without snappy compression
   3. Collect C++ profile data with perf/flamegraph
   4. Look at changing download chunk sizes from 2MB to something larger like 
5MB or 10 MB for object-store filesystems (I think I recall  someone saying 2MB 
was the default, but that that was probably decided for local posix filesystems)
   
   No obligation of course for anyone to do this work.  This comment is as much 
for people who I work with as it is for possible Arrow maintainers if they're 
interested (I hope that you're interested!).  For convenience, [the notebook I 
was using 
above](https://gist.github.com/mrocklin/c1fd89575b40c055a9be77b2a47894df).  cc 
@fjetter 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Parquet deserialization speeds slower on Linux [arrow]

Reply via email to