[jira] [Commented] (BEAM-1799) IO ITs: simplify data loading design pattern

ASF GitHub Bot (JIRA) Thu, 13 Jul 2017 13:52:29 -0700

    [ 
https://issues.apache.org/jira/browse/BEAM-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086382#comment-16086382
 ]


ASF GitHub Bot commented on BEAM-1799:
--------------------------------------

Github user asfgit closed the pull request at:

    https://github.com/apache/beam/pull/2507


> IO ITs: simplify data loading design pattern
> --------------------------------------------
>
>                 Key: BEAM-1799
>                 URL: https://issues.apache.org/jira/browse/BEAM-1799
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-java-extensions
>            Reporter: Stephen Sisk
>            Assignee: Stephen Sisk
>             Fix For: 2.0.0
>
>
> Problems with the current solution
> =============================
> * The IO IT data loading guidelines [1] are complicated & aren't "native 
> junit" - you end up working around junit rather than working with it (I was a 
> part of defining them[0], so I critique the rules with (heart) )
> * Doing data loading using external tools means we have additional 
> dependencies outside of the tests themselves. If we *must* use them, it's 
> worth the time, but I think we have another option. I find it especially 
> amusing since the data loading tools are things like ycsb which themselves 
> are benchmarking tools ("I heard you like performance benchmarking, so here's 
> a performance benchmarking tool to use before you use your performance 
> benchmarking tool"), and really are just solving the problem of "I want to 
> write data in parallel to this data store" - that sounds familiar :) 
> The current guidelines also don't scale well to performance tests:
> * We want to write medium sized data for perf tests - doing data loading 
> using external tools means a minimum of 2 reads & writes. For the small scale 
> ITs, that's not a big deal, but for the large scale tests, if we assume we're 
> working with a fixed budget, more data transferred/stored ~= fewer tests.
> * If you want to verify that large data sets are correct (or create them), 
> you need to actually read and write those large data sets - currently, the 
> plan is that data loading/testing infrastructure only runs on one machine, so 
> those operations are going to be slow. We aren't working with actual large 
> data sets, so it won't take too long, but it's always nice to have faster 
> tests.
> New Proposed Solution
> ===================
> Instead of trying to test read and write separately, the test should be a 
> "write, then read back what you just wrote", all using the IO under test. To 
> support scenarios like "I want to run my read test repeatedly without 
> re-writing the data", tests would add flags for "skipCleanUp" and 
> "useExistingData".
> Check out the example I wrote up [2]
> I didn't want to invest much time on this before I opened a Jira/talked to 
> others, so I plan on expanding on this a bit more/formalizing it in the 
> testing docs.
> A reminder of some context:
> * The goals for the ITs & Perf tests are that they are *not* intended to be 
> the place where we exercise specific scenarios. Instead, they are tripwires 
> designed to find problems with code *we already believe works* (as proven by 
> the unit tests) when it runs against real data store instances/runners using 
> multiple nodes of both.
> There are some definite disadvantages: 
> * There is a class of bugs that you can miss doing this. (namely: "I mangled 
> the data on the way into the data store, and then reverse-mangled it again on 
> the way back out so it looks fine, even though it is bad in the db") I assume 
> that many of us have tested storage code in the past, and so we've thought 
> about this trade-off. In this particular environment, where it's 
> expensive/tricky to do independent testing of the storage code, I think this 
> is the right trade off.
> * The data loading scripts cannot be re-used between languages. I think this 
> will be a pretty small relative cost compared to the cost of writing the IO 
> in multiple languages, so it shouldn't matter too much. I think we'll save 
> more time in not needing to use external tools for loading data.
> * Read-only or write-only data stores - in this case, we'll either need to 
> default to the old plan, or implement data loading or verification using beam
> * This assumes the data store support parallelism - in the case where the 
> read or write cannot be split, we probably should limit the amount of data we 
> process in the tests to what we can reasonably do on a single worker anyway.
> * It's harder to debug when this fails - I agree, and part of what I hope to 
> invest a little time in as I go forward is to make it easier to determine 
> what the actual failure is. Presumably folks debugging a particular IO's 
> failures have tools to look at that IO and will be able to quickly determine 
> if it's failing on the read or write.
> * As with the previously before accepted proposal, we are relying on junit's 
> afterClass to do cleanups. I don't have a good answer for this - if it proves 
> to be a problem, we can investigate. 
> * This focuses the test exclusively on reading and writing. To address this, 
> if we wanted to write other types of tests, they could either piggy back off 
> the writeThenRead test, or it might be that they should be restricted to 
> smaller data sets and they should be tested independently from this test and 
> simply write their own data to the data store.
> There are some really nice advantages:
> * The test ends up being pretty simple and elegant.
> * We have no external dependencies
> * Read and write occurs the bare minimum number of times
> * I believe we'll be able to create shared PTransforms for generating test 
> data & validating test data.
> [0] Past discussion of IT guidelines - 
> https://lists.apache.org/thread.html/a8ea2507aee4a849cbb6cd7f3ae23fc8b47d447bd553fa01d6da6348@%3Cdev.beam.apache.org%3E
> [1] Current data loading for IT guidelines - 
> https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit#heading=h.uj505twpx0m
> [2] Example of writeThenRead test - 
> https://github.com/ssisk/beam/blob/jdbc-it-perf/sdks/java/io/jdbc/src/test/java/org/apache/beam/sdk/io/jdbc/JdbcIOIT.java#L147



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (BEAM-1799) IO ITs: simplify data loading design pattern

Reply via email to