Re: [PR] Update LargeScale.md [datasketches-website]

via GitHub Sat, 02 Mar 2024 14:33:30 -0800


leerho commented on code in PR #163:
URL: 
https://github.com/apache/datasketches-website/pull/163#discussion_r1510086714



##########
docs/Architecture/LargeScale.md:
##########
@@ -21,20 +21,47 @@ layout: doc_page
 -->
 ## Designed for Large-scale Computing Systems
 
+#### Multiple Languages
+
+* The DataSketches library is now available in three languages, Java, C++, and 
Python. A forth language, GoLang, is in development.
+ 
+
+### Compatibility Across Languages, Software Versions And Binary Serialization 
Versions
+Large-scale computing environments may have a mix of various platforms 
utilizing different programming languages each with the possiblity of using 
different Software Versions of our DataSketches library.  Cross version 
compatibility of software is a challenge that all platforms face in general, 
and it is up to the platform maintainers to keep their software up-to-date. 
This not new and not different with the DataSketches library.  
+
+Nonetheless, it our goal to strive to make it as easy as practically possible 
to serialize our sketches in one of our supported languages on one platform and 
to be deserialized in a different supported language, potentially on a 
different, even remote platform, and perhaps much later in time.  
+
+With this goal in mind, here are some of the key strategic decisions we have 
made in the development of the DataSketches library. 
+
+#### Two levels of versioning.
+
+* **Software Version:** This is the release version, published via Apache.org 
and specified in the POM file or equivalent. This can change relatively 
frequently based on bug fixes and introduction of new capabilities. We follow 
the principles of *Semantic Versioning* as specified by 
[semver.org](https://semver.org).
+
+* **Serialization Version:** (*SerVer*) This is a small integer placed in the 
preamble of the serialized byte array that indicates the version of the 
serialized structure for the sketch. This is very similar to Java's [*Class 
File Format Version*](https://en.wikipedia.org/wiki/Java_class_file). A single 
*SerVer* may represent multiple structures all based on the same sketch when 
stored in different states, e.g., *Single Item*, *Compact*, *Updatable*, etc). 
This *SerVer* changes very rarely, if at all. Of all of our sketches, only a 
few, e.g., Theta, KLL and Sampling, have had more than one *SerVer* over time. 
There are and will be many *Software Versions* of the same sketch that still 
use the same *SerVer*. When we have to update the *SerVer*, we provide the 
capability in the *Software Version* of the code associated with the new 
*SerVer* the ability to read and convert the old *SerVer* to the new *SerVer*. 
This is why our newest *Software Versions* can still read and interpret olde
 r *SerVer* serialized sketches that go back to when our project was started at 
Yahoo (2012), and before we went open-source (2015). Technically speaking this 
can be described as *Backward-Transient* compatibility [Schema Evolution and 
Compatibility](https://docs.confluent.io/platform/current/schema-registry/fundamentals/schema-evolution.html)
 and [Schema Evolution](https://en.wikipedia.org/wiki/Schema_evolution).
+
+From the user's perspective, as long as the *SerVer* is the same, older 
*Software Versions* should be able to read sketch images created by newer 
*Software Versions*. But the APIs may be different, obviously. An older 
*Software Version* will not be able to take advantage of new features 
introduced in new *Software Versions*, but it should be able to do what it did 
before. In other words, there will be no loss of access to the serialized 
sketch and the older *Software Version* capabilities. A user should not need to 
access the *SerVer*, nonetheless it is always stored in index one of the 
serialized image. If a sketch is presented with a *SerVer* that it is not 
compatible with, the sketch should throw an exception and say what the problem 
is, just like Java does with its *Class File Format Versions*.
+
+#### The Serialized Image of a Sketch
+* The structure (or image) of a serialized sketch is independent of the 
language from which it was created. 
+* The sketch image only contains little-endian primitives, such as int64, 
int32, int16, int8, double-64, float-32, UTF-8 strings, and simple array 
structures of those, which can be easily interpreted in many languages on 
modern CPUs. We do not support big-endian serialization.
+* The sketch image is unique for each type of sketch.
+* Simply speaking, a sketch image can be viewed as a blob of bytes, which is 
easily stored and easily transported using many different protocols, including 
Protobuf, Avro, Thrift, Byte64, etc.
+

Review Comment:
   Excellent point.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Update LargeScale.md [datasketches-website]

Reply via email to