mrocklin commented on issue #38389: URL: https://github.com/apache/arrow/issues/38389#issuecomment-1823066197
@eeroel that's an interesting issue. Thank you for sharing and fixing. The chart you share makes the problem super-clear. I'll be curious to see how it impacts observed performance. I'm still curious about the high CPU-utilization. I was chatting separately with Wes and he mentioned the following (I don't think I'm sharing anythign private) > I would guess that collecting C++ profile data via perf/flamegraph would tell an interesting story about what's taking up the most time in the Parquet deserialization (you generally need to recompile things with -fno-omit-frame-pointer to get reasonable looking profiles). Snappy is a common culprit Maybe I should try with non-snappy-compressed data and see how things behave. That would probably help to bisect the problem space. I'm going to summarize a few possible directions to investigate that came out of this discussion: 1. Try again after #38591 2. Try without snappy compression 3. Collect C++ profile data with perf/flamegraph 4. Look at changing download chunk sizes from 2MB to something larger like 5MB or 10 MB for object-store filesystems (I think I recall someone saying 2MB was the default, but that that was probably decided for local posix filesystems) No obligation of course for anyone to do this work. This comment is as much for people who I work with as it is for possible Arrow maintainers if they're interested (I hope that you're interested!). For convenience, [the notebook I was using above](https://gist.github.com/mrocklin/c1fd89575b40c055a9be77b2a47894df). cc @fjetter -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
