leerho commented on code in PR #164:
URL:
https://github.com/apache/datasketches-website/pull/164#discussion_r1511616805
##########
docs/Theta/ThetaUpdateSpeed.md:
##########
@@ -48,16 +48,16 @@ To illustrate how the the optional <i>Resize Factor</i>
affects performance refe
<img class="doc-img-full"
src="{{site.docs_img_dir}}/theta/UpdateSpeedWithRF.png" alt="UpdateSpeedWithRF"
/>
* The blue curve is the same as the blue curve in the graph at the top of this
page.
-It was generated with <i>ResizeFactor = X1</i>, which means the sketch cache
was initially created at full size, thus there is no resizing of the cache
during the life of the sketch. (There will be <i>rebuilding</i> but not
resizing.)
-* The red curve is the same sketch but configured with <i>ResizeFactor =
X2</i>, which means the sketch cache is initially created at a small size
(enough for about 16 entries), and then when the sketch needs more space for
the cache, it is resized by a factor of 2. Each of these resize events are
clearly seen in the plot as sawtooth spikes in the speed performance where the
sketch must pause, allocate a larger cache and then resize and reconstruct the
hash table. These spikes fall at factors of 2 along the X-axis (with the
exception of the last interval, which is a factor of 4 for reasons too
complicated to explain here.)
+It was generated with <i>ResizeFactor = X1</i>, which means the sketch cache
was initially created at full size, thus there is no resizing of the cache
during the life of the sketch. (There will be <i>rebuilding</i> but no
resizing.)
+* The red curve is the same sketch but configured with <i>ResizeFactor =
X2</i>, which means the sketch cache is initially created at a small size
(enough for about 16 entries), and then when the sketch needs more space for
the cache, it is resized by a factor of 2. Each of these resize events are
clearly seen in the curve as sawtooth spikes where the sketch must pause,
allocate a larger cache and reconstruct the hash-table. These spikes fall at
factors of 2 along the X-axis (with the exception of the last interval, which
is a factor of 4 for reasons too complicated to explain here.)
* The green curve is the same sketch but configured with <i>ResizeFactor =
X8</i>. An RF = X4 is also available but not shown.
As one would expect the overall speed of the RF = X2 sketch is slower than the
RF = X1 sketch and the RF = X8 sketch is inbetween due to the amount of time
the sketch spends in resizing the cache.
The tradeoff here is the classic memory size versus speed. Suppose you have
millions of sketches that need to be allocated and your input data is highly
skewed (as is often the case). Most of the sketches will only have a few
entries and only a small fraction of all the sketches will actually go into
estimation mode and require a full-sized cache. The Resize Factor option
allows a memory allocation that would be orders of magnitude smaller than would
be required if all the sketches had to be allocated at full size. The default
Resize Factor is X8, which is a nice compromise for many environments.
### How these graphs were generated
-The goal of these measurements was to measure the limits of how fast these
sketches could update data from a continuous data stream not limited by system
overhead, string or array processing. In order to remove random noise from the
plots, each point on the graph represents an average of many trials. For the
low end of the graph the number of trials per point is 2^23 or 8M trials per
point. At the high end of 8 million uniques per trial the number of trials per
point is 2^4 or 16.
+The goal of these measurements was to measure the limits of how fast these
sketches could update data from a continuous data stream not limited by system
overhead, string or array processing. In order to remove random noise from the
plots, each point on the graph represents an average of many trials. For the
low end of the graph the number of trials per point is 2^23 or 8M trials per
point. At the high end, of 8 million uniques per trial, the number of trials
per point is 2^4 or 16.
Review Comment:
fixed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]