[GitHub] [arrow] jonkeane commented on pull request #9615: ARROW-3316: [R] Multi-threaded conversion from R data.frame to Arrow table / record batch

GitBox Fri, 14 May 2021 06:45:17 -0700


jonkeane commented on pull request #9615:
URL: https://github.com/apache/arrow/pull/9615#issuecomment-841253893

Ok, I've run these benchmarks again. And we're seeing a massive improvement
across the board (with one small exception). All of the simple types are
~50-75% faster.

The naturalistic datasets are where this really shines though: Those are
85-90% faster.

This is fantastic!

# 🎊🚀🤯🚀🎊

There is one oddity (that I will dig into in a second) in the single-core
tests: those *also* are faster (though this might be limited to only the
strings case). The (relevant) way the benchmarks [set single versus multi
core](https://github.com/ursacomputing/arrowbench/blob/c6453c8ef18e1f6c03b1e1542149e8adf6bbde95/R/run.R#L206-L213)
is using `arrow:::SetCpuThreadPoolCapacity()` (the other values in there like
setting the option `Ncpus` are intended to catch other libraries and shouldn't
conflict, but I've highlighted them anyway). This is different than
`option_use_threads`, though I would have expected that setting the thread pool
capacity to one would have similar performance as before. Like I said above,
I'll dig into this and see if it's actually using more than one core or if
these optimizations _also_ optimized the single core case (though I'm a bit
skeptical they could have optimized it *this much*)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jonkeane commented on pull request #9615: ARROW-3316: [R] Multi-threaded conversion from R data.frame to Arrow table / record batch

Reply via email to