Re: [PR] Fixed minor typos. [datasketches-website]

via GitHub Mon, 04 Mar 2024 09:59:09 -0800


jmalkin commented on code in PR #164:
URL: 
https://github.com/apache/datasketches-website/pull/164#discussion_r1511551492



##########
docs/Theta/ThetaUpdateSpeed.md:
##########
@@ -22,23 +22,23 @@ layout: doc_page
 ## Theta Family Update Speed
 
 ### Resize Factor = X1
-The following graph illustrates the update speed of 3 different sketches from 
the library: the Heap QuickSelect Sketch, the Off-Heap QuickSelect Sketch, and 
the Heap Alpha Sketch.
-The X-axis is the number of unique values presented to a sketch. The Y-axis is 
the average time to perform an update.  It is computed as the total time to 
update X-uniques divided by X-uniques.
+The following graph illustrates the update speed of 3 different sketches from 
the library: the Heap QuickSelect (QS) Sketch, the Off-Heap QuickSelect (QS) 
Sketch, and the Heap Alpha Sketch.

Review Comment:
   Do we need to define QS twice?



##########
docs/Theta/ThetaUpdateSpeed.md:
##########
@@ -22,23 +22,23 @@ layout: doc_page
 ## Theta Family Update Speed
 
 ### Resize Factor = X1
-The following graph illustrates the update speed of 3 different sketches from 
the library: the Heap QuickSelect Sketch, the Off-Heap QuickSelect Sketch, and 
the Heap Alpha Sketch.
-The X-axis is the number of unique values presented to a sketch. The Y-axis is 
the average time to perform an update.  It is computed as the total time to 
update X-uniques divided by X-uniques.
+The following graph illustrates the update speed of 3 different sketches from 
the library: the Heap QuickSelect (QS) Sketch, the Off-Heap QuickSelect (QS) 
Sketch, and the Heap Alpha Sketch.
+The X-axis is the number of unique values presented to a sketch. The Y-axis is 
the average time to perform an update.  It is computed as the total time to 
update X-uniques, divided by X-uniques.
 
-The high values on the left are due to Java overhead and JVM warmup.  The 
humps in the middle of the graph are due to the internal hash table filling up 
and forcing an internal rebuild and reducing theta.  For this plot the sketches 
were configured with <i>k</i> = 4096. 
-The sawtooth peaks on the QS plots represent successive reqbuilds.  The 
downward slope on the right side of the hump is the sketch speeding up because 
it is rejecting more and more incoming hash values due to the continued 
reduction in the value of theta.
-The Alpha sketch (in red) uses a more advanced hash table update algorithm 
that defers the first rebuild until after theta has started decreasing.  This 
is the little spike just to the right of the hump.
+The high values on the left are due to Java overhead and JVM warmup.  The 
spikes starting at about 4K uniques are due to the internal hash-table filling 
up and forcing an internal hash-table rebuild, which also reduces theta.  For 
this plot the sketches were configured with <i>k</i> = 4096. 
+The sawtooth peaks on the QuickSelect curves represent successive rebuilds.  
The downward slope on the right side of the largest spike is the sketch 
speeding up because it is rejecting more and more incoming hash values due to 
the continued reduction in the value of theta.
+The Alpha sketch (in red) uses a more advanced hash-table update algorithm 
that defers the first rebuild until after theta has started decreasing.  This 
is the little spike just to the right of the maximum of the curve.

Review Comment:
   The maximum is way down at 1. Maybe give the approximate x value?



##########
docs/Theta/ThetaUpdateSpeed.md:
##########
@@ -22,23 +22,23 @@ layout: doc_page
 ## Theta Family Update Speed
 
 ### Resize Factor = X1
-The following graph illustrates the update speed of 3 different sketches from 
the library: the Heap QuickSelect Sketch, the Off-Heap QuickSelect Sketch, and 
the Heap Alpha Sketch.
-The X-axis is the number of unique values presented to a sketch. The Y-axis is 
the average time to perform an update.  It is computed as the total time to 
update X-uniques divided by X-uniques.
+The following graph illustrates the update speed of 3 different sketches from 
the library: the Heap QuickSelect (QS) Sketch, the Off-Heap QuickSelect (QS) 
Sketch, and the Heap Alpha Sketch.
+The X-axis is the number of unique values presented to a sketch. The Y-axis is 
the average time to perform an update.  It is computed as the total time to 
update X-uniques, divided by X-uniques.
 
-The high values on the left are due to Java overhead and JVM warmup.  The 
humps in the middle of the graph are due to the internal hash table filling up 
and forcing an internal rebuild and reducing theta.  For this plot the sketches 
were configured with <i>k</i> = 4096. 
-The sawtooth peaks on the QS plots represent successive reqbuilds.  The 
downward slope on the right side of the hump is the sketch speeding up because 
it is rejecting more and more incoming hash values due to the continued 
reduction in the value of theta.
-The Alpha sketch (in red) uses a more advanced hash table update algorithm 
that defers the first rebuild until after theta has started decreasing.  This 
is the little spike just to the right of the hump.
+The high values on the left are due to Java overhead and JVM warmup.  The 
spikes starting at about 4K uniques are due to the internal hash-table filling 
up and forcing an internal hash-table rebuild, which also reduces theta.  For 
this plot the sketches were configured with <i>k</i> = 4096. 

Review Comment:
   I don't think I've seen hash table hyphenated before. Why are we changing to 
that usage here?



##########
docs/Theta/ThetaUpdateSpeed.md:
##########
@@ -48,16 +48,16 @@ To illustrate how the the optional <i>Resize Factor</i> 
affects performance refe
 <img class="doc-img-full" 
src="{{site.docs_img_dir}}/theta/UpdateSpeedWithRF.png" alt="UpdateSpeedWithRF" 
/>
 
 * The blue curve is the same as the blue curve in the graph at the top of this 
page. 
-It was generated with <i>ResizeFactor = X1</i>, which means the sketch cache 
was initially created at full size, thus there is no resizing of the cache 
during the life of the sketch.  (There will be <i>rebuilding</i> but not 
resizing.)
-* The red curve is the same sketch but configured with <i>ResizeFactor = 
X2</i>, which means the sketch cache is initially created at a small size 
(enough for about 16 entries), and then when the sketch needs more space for 
the cache, it is resized by a factor of 2. Each of these resize events are 
clearly seen in the plot as sawtooth spikes in the speed performance where the 
sketch must pause, allocate a larger cache and then resize and reconstruct the 
hash table.  These spikes fall at factors of 2 along the X-axis (with the 
exception of the last interval, which is a factor of 4 for reasons too 
complicated to explain here.)
+It was generated with <i>ResizeFactor = X1</i>, which means the sketch cache 
was initially created at full size, thus there is no resizing of the cache 
during the life of the sketch.  (There will be <i>rebuilding</i> but no 
resizing.)
+* The red curve is the same sketch but configured with <i>ResizeFactor = 
X2</i>, which means the sketch cache is initially created at a small size 
(enough for about 16 entries), and then when the sketch needs more space for 
the cache, it is resized by a factor of 2. Each of these resize events are 
clearly seen in the curve as sawtooth spikes where the sketch must pause, 
allocate a larger cache and reconstruct the hash-table.  These spikes fall at 
factors of 2 along the X-axis (with the exception of the last interval, which 
is a factor of 4 for reasons too complicated to explain here.)
 * The green curve is the same sketch but configured with <i>ResizeFactor = 
X8</i>.  An RF = X4 is also available but not shown.  
 
 As one would expect the overall speed of the RF = X2 sketch is slower than the 
RF = X1 sketch and the RF = X8 sketch is inbetween due to the amount of time 
the sketch spends in resizing the cache.
 
 The tradeoff here is the classic memory size versus speed.  Suppose you have 
millions of sketches that need to be allocated and your input data is highly 
skewed (as is often the case).  Most of the sketches will only have a few 
entries and only a small fraction of all the sketches will actually go into 
estimation mode and require a full-sized cache.  The Resize Factor option 
allows a memory allocation that would be orders of magnitude smaller than would 
be required if all the sketches had to be allocated at full size.  The default 
Resize Factor is X8, which is a nice compromise for many environments.
 
 ### How these graphs were generated
-The goal of these measurements was to measure the limits of how fast these 
sketches could update data from a continuous data stream not limited by system 
overhead, string or array processing. In order to remove random noise from the 
plots, each point on the graph represents an average of many trials.  For the 
low end of the graph the number of trials per point is 2^23 or 8M trials per 
point. At the high end of 8 million uniques per trial the number of trials per 
point is 2^4 or 16.  
+The goal of these measurements was to measure the limits of how fast these 
sketches could update data from a continuous data stream not limited by system 
overhead, string or array processing. In order to remove random noise from the 
plots, each point on the graph represents an average of many trials.  For the 
low end of the graph the number of trials per point is 2^23 or 8M trials per 
point. At the high end, of 8 million uniques per trial, the number of trials 
per point is 2^4 or 16.  

Review Comment:
   I think I'd use "at" or "with" instead of "of" with this change:
   At the high end, at 8 million uniques per trial, ...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fixed minor typos. [datasketches-website]

Reply via email to