On 28 May 2019, at 17:31, Łukasz Gajowy <[email protected]> wrote: > > I'm not quite following what these sizes are needed for--aren't the > benchmarks already tuned to be specific, known sizes? > > Maybe I wasn't clear enough. Such metric is useful mostly in IO tests - > different IOs generate records of different size. It would be ideal for us to > have a universal way to get total size so that we could provide some > throughput measurement (we can easily get time). In Load tests we indeed have > known sizes but as I said above in point 2 - maybe it's worthy to look at the > other size as well (to compare)?
Łukasz, I’m sorry but it’s still not clear for me - what is a point to compare these sizes? I want to say that If we already have a size of generated load (like expected data size) and processing time after the test run, then we can calculate throughput. In addition, we compute and check a hash of all processed data and compare it with expected hash to make sure that there is no data loss or corruption. Do I miss something? > > especially for benchmarking purposes a 5x > overhead means you're benchmarking the sizing code, not the pipeline > itself. > > Exactly. We don't want to do this. > > Beam computes estimates for PCollection sizes by using coder and > sampling and publishes these as counters. It'd be best IMHO to reuse > this. Are these counters not sufficient? > > I didn't know that and this should do the trick! Is such counter available > for all sdks (or at least Python and Java)? Is it supported for all runners > (or at least Flink and Dataflow)? Where can I find it to see if it fits? > > Thanks! > > > wt., 28 maj 2019 o 16:46 Robert Bradshaw <[email protected] > <mailto:[email protected]>> napisał(a): > I'm not quite following what these sizes are needed for--aren't the > benchmarks already tuned to be specific, known sizes? I agree that > this can be expensive; especially for benchmarking purposes a 5x > overhead means you're benchmarking the sizing code, not the pipeline > itself. > > Beam computes estimates for PCollection sizes by using coder and > sampling, and publishes these as counters. It'd be best IMHO to reuse > this. Are these counters not sufficient? > > On Tue, May 28, 2019 at 12:55 PM Łukasz Gajowy <[email protected] > <mailto:[email protected]>> wrote: > > > > Hi all, > > > > part of our work while creating benchmarks for Beam is to collect total > > data size (bytes) that was put inside the testing pipeline. We need that in > > load tests of core beam operations (to see how big was the load really) and > > IO tests (to calculate throughput). The "not so good" way we're doing it > > right now is that we add a DoFn step called "ByteMonitor" to the pipeline > > to get the size of every element using a utility called > > "ObjectSizeCalculator [1]. > > > > Problems with this approach: > > 1. It's computationally expensive. After introducing this change, tests are > > 5x slower than before. This is due to the fact that now the size of each > > record is calculated separately. > > 2. Naturally, the size of a particular record measured this way is greater > > than the size of the generated key+values itself. Eg. if a synthetic source > > generates key + value that has 10 bytes total, after collecting the total > > bytes metric it's 8x greater (due to wrapping the value in richer objects, > > allocating more memory than needed, etc). > > > > The main question here is: which size of particular records is more > > interesting in benchmarks? The, let's call it, "net" size (key + value > > size, and nothing else), or the "gross" size (including all allocated > > memory for a particular element in PCollection and all the overhead of > > wrapping it in richer objects)? Maybe both sizes are good to be measured? > > > > For the "net" size we probably could (should?) do something similar to what > > Nexmark suites have: pre-define size per each element type and read it once > > the element is spotted in the pipeline [3]. > > > > What do you think? Is there any other (efficient + reliable) way of > > measuring the total load size that I missed? > > > > Thanks for opinions! > > > > Best, > > Łukasz > > > > [1] > > https://github.com/apache/beam/blob/a16a5b71cf8d399070a72b0f062693180d56b5ed/sdks/java/testing/test-utils/src/main/java/org/apache/beam/sdk/testutils/metrics/ByteMonitor.java > > > > <https://github.com/apache/beam/blob/a16a5b71cf8d399070a72b0f062693180d56b5ed/sdks/java/testing/test-utils/src/main/java/org/apache/beam/sdk/testutils/metrics/ByteMonitor.java> > > [2] https://issues.apache.org/jira/browse/BEAM-7431 > > <https://issues.apache.org/jira/browse/BEAM-7431> > > [3] > > https://github.com/apache/beam/blob/eb3b57554d9dc4057ad79bdd56c4239bd4204656/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/model/KnownSize.java > > > > <https://github.com/apache/beam/blob/eb3b57554d9dc4057ad79bdd56c4239bd4204656/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/model/KnownSize.java>
