The Go SDK doesn't yet have these counters implemented or published (sampling elements &countinf between DoFns, etc).
On Tue, May 28, 2019, 9:08 AM Alexey Romanenko <[email protected]> wrote: > On 28 May 2019, at 17:31, Łukasz Gajowy <[email protected]> wrote: > > > I'm not quite following what these sizes are needed for--aren't the > benchmarks already tuned to be specific, known sizes? > > Maybe I wasn't clear enough. Such metric is useful mostly in IO tests - > different IOs generate records of different size. It would be ideal for us > to have a universal way to get total size so that we could provide some > throughput measurement (we can easily get time). In Load tests we indeed > have known sizes but as I said above in point 2 - maybe it's worthy to look > at the other size as well (to compare)? > > > Łukasz, I’m sorry but it’s still not clear for me - what is a point to > compare these sizes? I want to say that If we already have a size of > generated load (like expected data size) and processing time after the test > run, then we can calculate throughput. In addition, we compute and check a > hash of all processed data and compare it with expected hash to make sure > that there is no data loss or corruption. Do I miss something? > > > > especially for benchmarking purposes a 5x > overhead means you're benchmarking the sizing code, not the pipeline > itself. > > Exactly. We don't want to do this. > > Beam computes estimates for PCollection sizes by using coder and > sampling and publishes these as counters. It'd be best IMHO to reuse > this. Are these counters not sufficient? > > I didn't know that and this should do the trick! Is such counter available > for all sdks (or at least Python and Java)? Is it supported for all runners > (or at least Flink and Dataflow)? Where can I find it to see if it fits? > > Thanks! > > > wt., 28 maj 2019 o 16:46 Robert Bradshaw <[email protected]> napisał(a): > >> I'm not quite following what these sizes are needed for--aren't the >> benchmarks already tuned to be specific, known sizes? I agree that >> this can be expensive; especially for benchmarking purposes a 5x >> overhead means you're benchmarking the sizing code, not the pipeline >> itself. >> >> Beam computes estimates for PCollection sizes by using coder and >> sampling, and publishes these as counters. It'd be best IMHO to reuse >> this. Are these counters not sufficient? >> >> On Tue, May 28, 2019 at 12:55 PM Łukasz Gajowy <[email protected]> >> wrote: >> > >> > Hi all, >> > >> > part of our work while creating benchmarks for Beam is to collect total >> data size (bytes) that was put inside the testing pipeline. We need that in >> load tests of core beam operations (to see how big was the load really) and >> IO tests (to calculate throughput). The "not so good" way we're doing it >> right now is that we add a DoFn step called "ByteMonitor" to the pipeline >> to get the size of every element using a utility called >> "ObjectSizeCalculator [1]. >> > >> > Problems with this approach: >> > 1. It's computationally expensive. After introducing this change, tests >> are 5x slower than before. This is due to the fact that now the size of >> each record is calculated separately. >> > 2. Naturally, the size of a particular record measured this way is >> greater than the size of the generated key+values itself. Eg. if a >> synthetic source generates key + value that has 10 bytes total, after >> collecting the total bytes metric it's 8x greater (due to wrapping the >> value in richer objects, allocating more memory than needed, etc). >> > >> > The main question here is: which size of particular records is more >> interesting in benchmarks? The, let's call it, "net" size (key + value >> size, and nothing else), or the "gross" size (including all allocated >> memory for a particular element in PCollection and all the overhead of >> wrapping it in richer objects)? Maybe both sizes are good to be measured? >> > >> > For the "net" size we probably could (should?) do something similar to >> what Nexmark suites have: pre-define size per each element type and read it >> once the element is spotted in the pipeline [3]. >> > >> > What do you think? Is there any other (efficient + reliable) way of >> measuring the total load size that I missed? >> > >> > Thanks for opinions! >> > >> > Best, >> > Łukasz >> > >> > [1] >> https://github.com/apache/beam/blob/a16a5b71cf8d399070a72b0f062693180d56b5ed/sdks/java/testing/test-utils/src/main/java/org/apache/beam/sdk/testutils/metrics/ByteMonitor.java >> > [2] https://issues.apache.org/jira/browse/BEAM-7431 >> > [3] >> https://github.com/apache/beam/blob/eb3b57554d9dc4057ad79bdd56c4239bd4204656/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/model/KnownSize.java >> > >
