On Mon, 17 May 2021 22:08:01 GMT, Paul Sandoz <psan...@openjdk.org> wrote:

>> I see your concern. I made some additional benchmarks and added them here. 
>> First, CountSized, which just gets the stream size without actual traversal. 
>> We can see how the performance changes depending on number of stream 
>> operations. I also added an optional type profile pollution that makes 
>> `exactOutputSize` virtual method polymorphic. Here's the results:
>> 
>> Baseline (The `count10Skip` test added just to ensure that patch works)
>> 
>> Benchmark  (pollute)  (size)  Mode  Cnt       Score      Error  Units
>> count0         false   10000  avgt  100      15,648 ±    0,182  ns/op
>> count2         false   10000  avgt  100      31,252 ±    0,113  ns/op
>> count4         false   10000  avgt  100      47,683 ±    0,165  ns/op
>> count6         false   10000  avgt  100      64,417 ±    0,203  ns/op
>> count8         false   10000  avgt  100      80,813 ±    0,265  ns/op
>> count10        false   10000  avgt  100     101,057 ±    0,295  ns/op
>> count10Skip    false   10000  avgt  100  497967,375 ± 5946,108  ns/op
>> count0          true   10000  avgt  100      18,843 ±    0,103  ns/op
>> count2          true   10000  avgt  100      33,716 ±    0,152  ns/op
>> count4          true   10000  avgt  100      49,062 ±    0,208  ns/op
>> count6          true   10000  avgt  100      66,773 ±    0,237  ns/op
>> count8          true   10000  avgt  100      82,727 ±    0,354  ns/op
>> count10         true   10000  avgt  100     104,499 ±    0,299  ns/op
>> count10Skip     true   10000  avgt  100  501723,220 ± 6361,932  ns/op
>> 
>> Type pollution adds some near-constant ~2ns overhead to the non-patched 
>> version as well.
>> 
>> Patched:
>> 
>> Benchmark  (pollute)  (size)  Mode  Cnt    Score   Error  Units
>> count0         false   10000  avgt  100   15,363 ± 0,086  ns/op
>> count2         false   10000  avgt  100   33,736 ± 0,138  ns/op
>> count4         false   10000  avgt  100   51,470 ± 0,205  ns/op
>> count6         false   10000  avgt  100   70,407 ± 0,262  ns/op
>> count8         false   10000  avgt  100   89,865 ± 0,262  ns/op
>> count10        false   10000  avgt  100  114,423 ± 0,363  ns/op
>> count10Skip    false   10000  avgt  100  139,963 ± 0,550  ns/op
>> count0          true   10000  avgt  100   26,538 ± 0,084  ns/op
>> count2          true   10000  avgt  100   46,089 ± 0,191  ns/op
>> count4          true   10000  avgt  100   66,560 ± 0,315  ns/op
>> count6          true   10000  avgt  100   87,852 ± 0,288  ns/op
>> count8          true   10000  avgt  100  109,037 ± 0,391  ns/op
>> count10         true   10000  avgt  100  139,759 ± 0,382  ns/op
>> count10Skip     true   10000  avgt  100  156,963 ± 1,862  ns/op
>> 
>> So indeed we have some performance drawback in patched version. Here's a 
>> chart:
>> 
>> ![image](https://user-images.githubusercontent.com/5114450/115098724-c754dd00-9f5b-11eb-86a0-b614a7d36fad.png)
>> I've calculated linear regression on (patched-baseline) times, depending on 
>> the number of ops. It's `y = 1.288x - 0.7078` for clean type profile and `y 
>> = 2.6174x + 6.9489` for polluted type profile. So, in the worst case, we 
>> have circa 2.6ns per operation plus 7ns constant overhead.
>> 
>> However, using Stream API without actually iterating the stream is very rare 
>> case. And if we iterate, the performance will be dominated by the number of 
>> iterations. I tried to model this case with SumSized benchmark (replacing 
>> count with sum, for 5 and 10 stream elements), but got very confusing 
>> results.
>> 
>> Baseline:
>> 
>> Benchmark  (pollute)  (size)  Mode  Cnt    Score    Error  Units
>> sum0            true       5  avgt  100  126,425 ±  0,793  ns/op
>> sum2            true       5  avgt  100  195,113 ±  1,359  ns/op
>> sum4            true       5  avgt  100  304,111 ±  8,302  ns/op
>> sum6            true       5  avgt  100  414,841 ±  3,215  ns/op
>> sum8            true       5  avgt  100  507,421 ±  4,781  ns/op
>> sum10           true       5  avgt  100  633,635 ±  7,105  ns/op
>> sum0           false       5  avgt  100   45,781 ±  0,258  ns/op
>> sum2           false       5  avgt  100   86,720 ±  0,573  ns/op
>> sum4           false       5  avgt  100  195,777 ±  1,145  ns/op
>> sum6           false       5  avgt  100  291,261 ±  2,091  ns/op
>> sum8           false       5  avgt  100  376,094 ±  3,283  ns/op
>> sum10          false       5  avgt  100  492,082 ±  7,914  ns/op
>> sum0            true      10  avgt  100  127,989 ±  0,758  ns/op
>> sum2            true      10  avgt  100  219,991 ±  3,081  ns/op
>> sum4            true      10  avgt  100  374,148 ±  7,426  ns/op
>> sum6            true      10  avgt  100  557,829 ±  3,959  ns/op
>> sum8            true      10  avgt  100  698,135 ±  4,915  ns/op
>> sum10           true      10  avgt  100  904,851 ± 14,458  ns/op
>> sum0           false      10  avgt  100   43,861 ±  0,107  ns/op
>> sum2           false      10  avgt  100  105,049 ±  0,276  ns/op
>> sum4           false      10  avgt  100  294,639 ±  1,499  ns/op
>> sum6           false      10  avgt  100  439,340 ±  4,223  ns/op
>> sum8           false      10  avgt  100  577,025 ±  5,760  ns/op
>> sum10          false      10  avgt  100  729,391 ±  6,045  ns/op
>> 
>> 
>> Patched:
>> 
>> Benchmark  (pollute)  (size)  Mode  Cnt    Score   Error  Units
>> sum0            true       5  avgt  100   68,466 ± 0,167  ns/op
>> sum2            true       5  avgt  100  107,240 ± 0,261  ns/op
>> sum4            true       5  avgt  100  209,469 ± 1,098  ns/op
>> sum6            true       5  avgt  100  300,873 ± 2,020  ns/op
>> sum8            true       5  avgt  100  378,654 ± 2,620  ns/op
>> sum10           true       5  avgt  100  473,769 ± 3,665  ns/op
>> sum0           false       5  avgt  100   49,435 ± 2,702  ns/op
>> sum2           false       5  avgt  100   96,237 ± 2,906  ns/op
>> sum4           false       5  avgt  100  195,196 ± 0,961  ns/op
>> sum6           false       5  avgt  100  286,542 ± 1,874  ns/op
>> sum8           false       5  avgt  100  371,664 ± 3,416  ns/op
>> sum10          false       5  avgt  100  457,178 ± 3,776  ns/op
>> sum0            true      10  avgt  100   69,223 ± 0,195  ns/op
>> sum2            true      10  avgt  100  120,507 ± 0,752  ns/op
>> sum4            true      10  avgt  100  291,328 ± 5,581  ns/op
>> sum6            true      10  avgt  100  439,136 ± 3,787  ns/op
>> sum8            true      10  avgt  100  569,188 ± 6,440  ns/op
>> sum10           true      10  avgt  100  709,916 ± 5,022  ns/op
>> sum0           false      10  avgt  100   46,347 ± 0,174  ns/op
>> sum2           false      10  avgt  100  109,131 ± 2,381  ns/op
>> sum4           false      10  avgt  100  296,566 ± 2,079  ns/op
>> sum6           false      10  avgt  100  430,852 ± 2,629  ns/op
>> sum8           false      10  avgt  100  562,795 ± 4,442  ns/op
>> sum10          false      10  avgt  100  716,229 ± 5,659  ns/op
>> 
>> 
>> Or, in graphical form:
>> ![image](https://user-images.githubusercontent.com/5114450/115098879-c2dcf400-9f5c-11eb-9fd3-f72ac7c1d59a.png)
>> ![image](https://user-images.githubusercontent.com/5114450/115098884-cc665c00-9f5c-11eb-9e9d-d7c9a7aa10ee.png)
>> 
>> For some reason, non-patched polluted version is the slowest, and I cannot 
>> see any stable overhead in patched version. For 4+ intermediate ops, patched 
>> version numbers are always better than the corresponding non-patched ones. I 
>> would be glad if somebody explains these numbers or point to a flaw in this 
>> benchmark.
>> 
>> What do you think, @PaulSandoz? Is it acceptable overhead or should I 
>> experiment with the new Stream flag?
>
> @amaembo this dropped of my radar, but the Skara reminder made it visible 
> again!
> 
> Thank you for the detailed analysis. I cannot explain the results of when 
> triggering profile pollution over the kinds of stream.
> I think we have good sense that the performance is not unduly perturbed for 
> small streams (there is more work done elsewhere than determining the actual 
> size of the sized stream).
> 
> Implementation-wise I still find it a little awkward that the operation is 
> responsible for calling the prior op in the pipeline, and we see the 
> consequences of that in the Slice op implementation that predicates on 
> `isParallel`.
> 
> It might be cleaner if the op is presented with some exact size and adjusts 
> it, something more pure e.g. for SliceOp:
> 
> long exactOutputSize(long sourceSize) {
>   return calcSize(sourceSize, skip, normalizedLimit);
> }
> 
> The default impl would be:
> 
> long exactOutputSize(long sourceSize) {
>   return sourceSize;
> }
> 
> 
> Then the pipeline helper is responsible for traversing the pipeline (be it 
> sequentially or in parallel, in the latter case the above method would not 
> get called since the slice op becomes the source spliterator for the next 
> stages).
> 
> To do that efficiently does I think require a new flag, set by an op and only 
> meaningful when SIZED is set (and cleared when SIZED is cleared, although 
> perhaps we only need to do that splitting stages for parallel execution, see 
> `AbstractPipeline.sourceSpliterator`).

@PaulSandoz I added a new flag and updated `exactOutputSize` per your 
suggestion, implementing the explicit iteration inside 
`AbstractPipeline#exactOutputSizeIfKnown`, along with `isParallel()` check. I'm 
not sure how to clear the new flag automatically but this is not so necessary, 
as it's not used if `SIZED` is not set. Also, I skipped the `sourceStage` 
(starting from `sourceState.nextStage`), to save one virtual call, as current 
implementation never has the source stage that adjusts the size.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3427

Reply via email to