westonpace commented on issue #36090:
URL: https://github.com/apache/arrow/issues/36090#issuecomment-1595586144
I investigated this today and I think I figured out the problem. When we
split out acero into it's own module we ended up introducing
`test_util_internal.cc` I believe this had originally been a part of the arrow
testing library but I didn't dig too much into where it came from.
To make it available to all the tests we created an `add_arrow_acero_test`
function which had...
```
add_arrow_test(${REL_TEST_NAME}
EXTRA_LINK_LIBS
${ARROW_ACERO_TEST_LINK_LIBS}
PREFIX
${PREFIX}
SOURCES
test_util_internal.cc
LABELS
${LABELS}
${ARG_UNPARSED_ARGUMENTS})
```
We also added `test_util_internal.cc` to all of the dataset tests in a
similar fashion.
Unfortunately, adding this common utility in this way actually means that
the file gets compiled again for every single test. Since this file is fairly
complex it ended up adding quite a bit to the overall compilation time.
The fix is straightforward but I may need some cmake help to figure out the
best way to do it in our existing infrastructure. I'd like to create a small
test library for acero that contains `src/arrow/acero/dummy_nodes.cc` and
`src/arrow/acero/test_util_internal.cc`. Also, I'd like to create a small test
library for datasets that contains `src/arrow/acero/test_util_internal.cc` and
`src/arrow/dataset/test_util_internal.cc`.
I don't believe we want to export these libraries (like we do with
arrow_testing). I think they can just be small static archives that get added
as dependencies to all of the tests.
As a hack to test this out I added `test_util_internal.cc` to the list of
acero sources. That appears to restore the build time to what I got in 11.0.0.
I did attempt to use ClangBuildAnalyzer to further reduce compile times. I
didn't find any clear wins. However, I was able to tune precompiled header
files to drop build times by an additional 35%. In other words, on my system:
Current Build Time: 2100
After fixing duplicate test files: 1900
With tuned precompiled headers: 1300
The files I needed to add to the precompiled headers were:
* datum.h - Since Datum is generally passed around by value this is an
inevitable include for anything remotely involved with compute.
* scalar.h - This is another header that was fairly ubiquitous. However,
we may be able to reduce this one with better forward declarations since
scalars are typically passed around by reference/pointer.
* compute/exec.h - This is needed for ExecBatch. Similar to Datum I think
this one is inevitable because ExecBatch is usually passed by value
* compute/expression.h - Same as the Datum and ExecBatch as Expression is
typically passed by value
* compute/function.h
* compute/kernel.h
Adding datum/exec/expression to the pch for acero and datasets may be
sufficient. I don't think we need it in the top-level pch.
These last two are not actually included excessively. However, they are
needed by all of the compute kernel files and they are very expensive includes
(kernel.h in particular). We could probably leave them out or, if we really
want them, find a way to make a dedicated pch file for just compute kernels.
For the testing pch I needed to add the gmock headers. gtest/gtest.h isn't
actually too bad but the gmock includes are very expensive.
Action items:
- [ ] Figure out how to make internal "common testing" libraries for acero
& dataset (maybe @kou can help?)
- [ ] See if we can reduce the number of times we include scalar.h
(@westonpace)
- [ ] Get `kernel.h` out of the compute api_xyz.h files (these only need
FunctionOptions which we can put in its own file) @westonpace
- [ ] Consider enabling precompiled headers on builds that don't seem to be
getting much benefit from ccache (I can look into this but I don't know much
about how well ccache is or isn't working on these CI jobs so I'd appreciate
any input)
I've prototyped some of these tasks in
https://github.com/westonpace/arrow/tree/experiment/build-times
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]