Github user cestella commented on the issue: https://github.com/apache/metron/pull/624 This does look good. A couple of observations in no particular order of importance; just wanted to get this out there for discussion. # Considering the overhead I want to consider the overhead not in our tests for a moment. In the last run, I count the following timings: * build - 5:41 * unit tests - 2:59 * integration tests - 14:44 * metron-config - 2:17 * verify licenses - 0:16 That's 25:57 out of a total run from Travis of 31:53, which is 5:56 overhead. We should factor that in. # Where to Focus ## Build Time The natural conclusion is to focus on the long pole, those integration tests, but we may be served to also consider the build time. Our build takes a long time and we depend upon parallelization to make the build return in a sensible time (the user time for the build is 26 minutes!). Furthermore, our build is extremely IO heavy due to the shading that we (necessarily) do. While we are on a shared system with the rest of the apache projects, I think reducing the IO burden of our build. While I think that shading is important, we have a very ham-fisted way of doing it. We shade for two reasons: * Relocation of dependencies that conflict * Creating uber jars for Storm One issue is that if we consider the tree of projects induced by their dependent nature, is that we shade non-leaf projects for purpose of relocation. I propose we stop doing that. Let's take, for instance, `metron-common`. We shade that project to relocate guava and beanutils. The consequences of relocating 2 packages is 47M of dependencies. Those 47M of dependencies also gets bundled again into all of the leaf projects (e.g. `metron-parsers`, etc.), thus shading twice. I propose fixing this one of two ways: * aggressively exclude ALL dependencies other than `org.apache.metron` and the relocated dependencies in any project that needs shading purely for relocation * Extract the shaded/relocated dependencies across the project into a separate project and make all of our non-leaf dependencies non-shaded I think the first may be the easiest to achieve and most surgical. Ultimately, it may even be advantageous to have a single jar created as the deployable output of our process (or maybe a small handful representing the independent subcomponents: `metron`, `MaaS` and `stellar`). ## Integration Tests Obviously the integration tests are the long pole in the tent. A couple of thoughts on these: ### `TaxiiIntegrationTest` My impression was that it was slow because parsing taxii via the mitre library was downright arduous. It costs us ~2:30 as of the working build above. We are passing a relatively large blob of taxii in and should consider trimming the taxii example data down to something more manageable and see if that will drop the timing down. ### `PcapIntegrationTest` We currently test two modes for the PcapIntegrationTest, pulling the timestamp from the key and pulling the timestamp from the message itself. We know that in production, we only want to support pulling the timestamp from the key. We might cut this test time in half by only testing the supported approach (it's 81 seconds as of last count). ### `Parser Integration Tests` We might want to reconsider what we integration test here. We currently have an integration test for every parser and we may get the same coverage by mocking out the `ParserWriterBolt` and constructing a shim to pass data in, execute against the real parser bolt, capture data written and evaluate the output. This would drop the overhead for each parser test dramatically (no storm or kafka) and would keep the semantics of the tests. Admittedly this may not be a focus in terms of bang-for-buck because total parser cost is around 86 seconds. # Reuse Integration Test Infrastructure This seems to be the persistent conversation whenever our tests start to push us over the edge. We incur quite a bit of overhead because we spin up and down integration test infrastructure in our `InMemoryComponent`s. We could consider correcting this in a couple of ways: * Reusing the infrastructure * Either use the in memory components or spin up light weight versions of the infrastructure and then run the tests against that (i.e. docker or separate-process versions of the in-memory components). * We'd need to refactor each integration test to clean up after itself so other tests are not splashed * Parallelizing the Integration Tests * Have the `InMemoryComponent`s be able to run in parallel * This would require refactoring the components to seek for open ports and use them. These are just my thoughts that I wanted to get out there and capture for review and comment.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---