Github user cestella commented on the issue:
https://github.com/apache/metron/pull/624
This does look good. A couple of observations in no particular order of
importance; just wanted to get this out there for discussion.
# Considering the overhead
I want to consider the overhead not in our tests for a moment. In the last
run, I count the following timings:
* build - 5:41
* unit tests - 2:59
* integration tests - 14:44
* metron-config - 2:17
* verify licenses - 0:16
That's 25:57 out of a total run from Travis of 31:53, which is 5:56
overhead.
We should factor that in.
# Where to Focus
## Build Time
The natural conclusion is to focus on the long pole, those integration
tests, but we may be served to also consider the build time. Our build takes a
long time and we depend upon parallelization to make the build return in a
sensible time (the user time for the build is 26 minutes!). Furthermore, our
build is extremely IO heavy due to the shading that we (necessarily) do. While
we are on a shared system with the rest of the apache projects, I think
reducing the IO burden of our build.
While I think that shading is important, we have a very ham-fisted way of
doing it. We shade for two reasons:
* Relocation of dependencies that conflict
* Creating uber jars for Storm
One issue is that if we consider the tree of projects induced by their
dependent nature, is that we shade non-leaf projects for purpose of relocation.
I propose we stop doing that. Let's take, for instance, `metron-common`. We
shade that project to relocate guava and beanutils. The consequences of
relocating 2 packages is 47M of dependencies. Those 47M of dependencies also
gets bundled again into all of the leaf projects (e.g. `metron-parsers`, etc.),
thus shading twice.
I propose fixing this one of two ways:
* aggressively exclude ALL dependencies other than `org.apache.metron` and
the relocated dependencies in any project that needs shading purely for
relocation
* Extract the shaded/relocated dependencies across the project into a
separate project and make all of our non-leaf dependencies non-shaded
I think the first may be the easiest to achieve and most surgical.
Ultimately, it may even be advantageous to have a single jar created as the
deployable output of our process (or maybe a small handful representing the
independent subcomponents: `metron`, `MaaS` and `stellar`).
## Integration Tests
Obviously the integration tests are the long pole in the tent. A couple of
thoughts on these:
### `TaxiiIntegrationTest`
My impression was that it was slow because parsing taxii via the mitre
library was downright arduous. It costs us ~2:30 as of the working build above.
We are passing a relatively large blob of taxii in and should consider
trimming the taxii example data down to something more manageable and see if
that will drop the timing down.
### `PcapIntegrationTest`
We currently test two modes for the PcapIntegrationTest, pulling the
timestamp from the key and pulling the timestamp from the message itself. We
know that in production, we only want to support pulling the timestamp from the
key. We might cut this test time in half by only testing the supported
approach (it's 81 seconds as of last count).
### `Parser Integration Tests`
We might want to reconsider what we integration test here. We currently
have an integration test for every parser and we may get the same coverage by
mocking out the `ParserWriterBolt` and constructing a shim to pass data in,
execute against the real parser bolt, capture data written and evaluate the
output. This would drop the overhead for each parser test dramatically (no
storm or kafka) and would keep the semantics of the tests. Admittedly this may
not be a focus in terms of bang-for-buck because total parser cost is around 86
seconds.
# Reuse Integration Test Infrastructure
This seems to be the persistent conversation whenever our tests start to
push us over the edge. We incur quite a bit of overhead because we spin up and
down integration test infrastructure in our `InMemoryComponent`s. We could
consider correcting this in a couple of ways:
* Reusing the infrastructure
* Either use the in memory components or spin up light weight versions of
the infrastructure and then run the tests against that (i.e. docker or
separate-process versions of the in-memory components).
* We'd need to refactor each integration test to clean up after itself so
other tests are not splashed
* Parallelizing the Integration Tests
* Have the `InMemoryComponent`s be able to run in parallel
* This would require refactoring the components to seek for open ports
and use them.
These are just my thoughts that I wanted to get out there and capture for
review and comment.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---