On 28 May 2019, at 17:31, Łukasz Gajowy <[email protected]> wrote:
> 
> I'm not quite following what these sizes are needed for--aren't the
> benchmarks already tuned to be specific, known sizes?
> 
> Maybe I wasn't clear enough. Such metric is useful mostly in IO tests - 
> different IOs generate records of different size. It would be ideal for us to 
> have a universal way to get total size so that we could provide some 
> throughput measurement (we can easily get time). In Load tests we indeed have 
> known sizes but as I said above in point 2 - maybe it's worthy to look at the 
> other size as well (to compare)?

Łukasz, I’m sorry but it’s still not clear for me - what is a point to compare 
these sizes? I want to say that If we already have a size of generated load 
(like expected data size) and processing time after the test run, then we can 
calculate throughput. In addition, we compute and check a hash of all processed 
data and compare it with expected hash to make sure that there is no data loss 
or corruption.  Do I miss something?


> 
> especially for benchmarking purposes a 5x
> overhead means you're benchmarking the sizing code, not the pipeline
> itself.
> 
> Exactly. We don't want to do this.
> 
> Beam computes estimates for PCollection sizes by using coder and
> sampling and publishes these as counters. It'd be best IMHO to reuse
> this. Are these counters not sufficient?
> 
> I didn't know that and this should do the trick! Is such counter available 
> for all sdks (or at least Python and Java)? Is it supported for all runners 
> (or at least Flink and Dataflow)? Where can I find it to see if it fits? 
> 
> Thanks!
> 
> 
> wt., 28 maj 2019 o 16:46 Robert Bradshaw <[email protected] 
> <mailto:[email protected]>> napisał(a):
> I'm not quite following what these sizes are needed for--aren't the
> benchmarks already tuned to be specific, known sizes? I agree that
> this can be expensive; especially for benchmarking purposes a 5x
> overhead means you're benchmarking the sizing code, not the pipeline
> itself.
> 
> Beam computes estimates for PCollection sizes by using coder and
> sampling, and publishes these as counters. It'd be best IMHO to reuse
> this. Are these counters not sufficient?
> 
> On Tue, May 28, 2019 at 12:55 PM Łukasz Gajowy <[email protected] 
> <mailto:[email protected]>> wrote:
> >
> > Hi all,
> >
> > part of our work while creating benchmarks for Beam is to collect total 
> > data size (bytes) that was put inside the testing pipeline. We need that in 
> > load tests of core beam operations (to see how big was the load really) and 
> > IO tests (to calculate throughput). The "not so good" way we're doing it 
> > right now is that we add a DoFn step called "ByteMonitor" to the pipeline 
> > to get the size of every element using a utility called 
> > "ObjectSizeCalculator [1].
> >
> > Problems with this approach:
> > 1. It's computationally expensive. After introducing this change, tests are 
> > 5x slower than before. This is due to the fact that now the size of each 
> > record is calculated separately.
> > 2. Naturally, the size of a particular record measured this way is greater 
> > than the size of the generated key+values itself. Eg. if a synthetic source 
> > generates key + value that has 10 bytes total, after collecting the total 
> > bytes metric it's 8x greater (due to wrapping the value in richer objects, 
> > allocating more memory than needed, etc).
> >
> > The main question here is: which size of particular records is more 
> > interesting in benchmarks? The, let's call it, "net" size (key + value 
> > size, and nothing else), or the "gross" size (including all allocated 
> > memory for a particular element in PCollection and all the overhead of 
> > wrapping it in richer objects)? Maybe both sizes are good to be measured?
> >
> > For the "net" size we probably could (should?) do something similar to what 
> > Nexmark suites have: pre-define size per each element type and read it once 
> > the element is spotted in the pipeline [3].
> >
> > What do you think? Is there any other (efficient + reliable) way of 
> > measuring the total load size that I missed?
> >
> > Thanks for opinions!
> >
> > Best,
> > Łukasz
> >
> > [1] 
> > https://github.com/apache/beam/blob/a16a5b71cf8d399070a72b0f062693180d56b5ed/sdks/java/testing/test-utils/src/main/java/org/apache/beam/sdk/testutils/metrics/ByteMonitor.java
> >  
> > <https://github.com/apache/beam/blob/a16a5b71cf8d399070a72b0f062693180d56b5ed/sdks/java/testing/test-utils/src/main/java/org/apache/beam/sdk/testutils/metrics/ByteMonitor.java>
> > [2] https://issues.apache.org/jira/browse/BEAM-7431 
> > <https://issues.apache.org/jira/browse/BEAM-7431>
> > [3] 
> > https://github.com/apache/beam/blob/eb3b57554d9dc4057ad79bdd56c4239bd4204656/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/model/KnownSize.java
> >  
> > <https://github.com/apache/beam/blob/eb3b57554d9dc4057ad79bdd56c4239bd4204656/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/model/KnownSize.java>

Reply via email to