[ https://issues.apache.org/jira/browse/BEAM-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086382#comment-16086382 ]
ASF GitHub Bot commented on BEAM-1799: -------------------------------------- Github user asfgit closed the pull request at: https://github.com/apache/beam/pull/2507 > IO ITs: simplify data loading design pattern > -------------------------------------------- > > Key: BEAM-1799 > URL: https://issues.apache.org/jira/browse/BEAM-1799 > Project: Beam > Issue Type: Improvement > Components: sdk-java-extensions > Reporter: Stephen Sisk > Assignee: Stephen Sisk > Fix For: 2.0.0 > > > Problems with the current solution > ============================= > * The IO IT data loading guidelines [1] are complicated & aren't "native > junit" - you end up working around junit rather than working with it (I was a > part of defining them[0], so I critique the rules with (heart) ) > * Doing data loading using external tools means we have additional > dependencies outside of the tests themselves. If we *must* use them, it's > worth the time, but I think we have another option. I find it especially > amusing since the data loading tools are things like ycsb which themselves > are benchmarking tools ("I heard you like performance benchmarking, so here's > a performance benchmarking tool to use before you use your performance > benchmarking tool"), and really are just solving the problem of "I want to > write data in parallel to this data store" - that sounds familiar :) > The current guidelines also don't scale well to performance tests: > * We want to write medium sized data for perf tests - doing data loading > using external tools means a minimum of 2 reads & writes. For the small scale > ITs, that's not a big deal, but for the large scale tests, if we assume we're > working with a fixed budget, more data transferred/stored ~= fewer tests. > * If you want to verify that large data sets are correct (or create them), > you need to actually read and write those large data sets - currently, the > plan is that data loading/testing infrastructure only runs on one machine, so > those operations are going to be slow. We aren't working with actual large > data sets, so it won't take too long, but it's always nice to have faster > tests. > New Proposed Solution > =================== > Instead of trying to test read and write separately, the test should be a > "write, then read back what you just wrote", all using the IO under test. To > support scenarios like "I want to run my read test repeatedly without > re-writing the data", tests would add flags for "skipCleanUp" and > "useExistingData". > Check out the example I wrote up [2] > I didn't want to invest much time on this before I opened a Jira/talked to > others, so I plan on expanding on this a bit more/formalizing it in the > testing docs. > A reminder of some context: > * The goals for the ITs & Perf tests are that they are *not* intended to be > the place where we exercise specific scenarios. Instead, they are tripwires > designed to find problems with code *we already believe works* (as proven by > the unit tests) when it runs against real data store instances/runners using > multiple nodes of both. > There are some definite disadvantages: > * There is a class of bugs that you can miss doing this. (namely: "I mangled > the data on the way into the data store, and then reverse-mangled it again on > the way back out so it looks fine, even though it is bad in the db") I assume > that many of us have tested storage code in the past, and so we've thought > about this trade-off. In this particular environment, where it's > expensive/tricky to do independent testing of the storage code, I think this > is the right trade off. > * The data loading scripts cannot be re-used between languages. I think this > will be a pretty small relative cost compared to the cost of writing the IO > in multiple languages, so it shouldn't matter too much. I think we'll save > more time in not needing to use external tools for loading data. > * Read-only or write-only data stores - in this case, we'll either need to > default to the old plan, or implement data loading or verification using beam > * This assumes the data store support parallelism - in the case where the > read or write cannot be split, we probably should limit the amount of data we > process in the tests to what we can reasonably do on a single worker anyway. > * It's harder to debug when this fails - I agree, and part of what I hope to > invest a little time in as I go forward is to make it easier to determine > what the actual failure is. Presumably folks debugging a particular IO's > failures have tools to look at that IO and will be able to quickly determine > if it's failing on the read or write. > * As with the previously before accepted proposal, we are relying on junit's > afterClass to do cleanups. I don't have a good answer for this - if it proves > to be a problem, we can investigate. > * This focuses the test exclusively on reading and writing. To address this, > if we wanted to write other types of tests, they could either piggy back off > the writeThenRead test, or it might be that they should be restricted to > smaller data sets and they should be tested independently from this test and > simply write their own data to the data store. > There are some really nice advantages: > * The test ends up being pretty simple and elegant. > * We have no external dependencies > * Read and write occurs the bare minimum number of times > * I believe we'll be able to create shared PTransforms for generating test > data & validating test data. > [0] Past discussion of IT guidelines - > https://lists.apache.org/thread.html/a8ea2507aee4a849cbb6cd7f3ae23fc8b47d447bd553fa01d6da6348@%3Cdev.beam.apache.org%3E > [1] Current data loading for IT guidelines - > https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m > [2] Example of writeThenRead test - > https://github.com/ssisk/beam/blob/jdbc-it-perf/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOIT.java#L147 -- This message was sent by Atlassian JIRA (v6.4.14#64029)