leerho commented on code in PR #163: URL: https://github.com/apache/datasketches-website/pull/163#discussion_r1510086714
########## docs/Architecture/LargeScale.md: ########## @@ -21,20 +21,47 @@ layout: doc_page --> ## Designed for Large-scale Computing Systems +#### Multiple Languages + +* The DataSketches library is now available in three languages, Java, C++, and Python. A forth language, GoLang, is in development. + + +### Compatibility Across Languages, Software Versions And Binary Serialization Versions +Large-scale computing environments may have a mix of various platforms utilizing different programming languages each with the possiblity of using different Software Versions of our DataSketches library. Cross version compatibility of software is a challenge that all platforms face in general, and it is up to the platform maintainers to keep their software up-to-date. This not new and not different with the DataSketches library. + +Nonetheless, it our goal to strive to make it as easy as practically possible to serialize our sketches in one of our supported languages on one platform and to be deserialized in a different supported language, potentially on a different, even remote platform, and perhaps much later in time. + +With this goal in mind, here are some of the key strategic decisions we have made in the development of the DataSketches library. + +#### Two levels of versioning. + +* **Software Version:** This is the release version, published via Apache.org and specified in the POM file or equivalent. This can change relatively frequently based on bug fixes and introduction of new capabilities. We follow the principles of *Semantic Versioning* as specified by [semver.org](https://semver.org). + +* **Serialization Version:** (*SerVer*) This is a small integer placed in the preamble of the serialized byte array that indicates the version of the serialized structure for the sketch. This is very similar to Java's [*Class File Format Version*](https://en.wikipedia.org/wiki/Java_class_file). A single *SerVer* may represent multiple structures all based on the same sketch when stored in different states, e.g., *Single Item*, *Compact*, *Updatable*, etc). This *SerVer* changes very rarely, if at all. Of all of our sketches, only a few, e.g., Theta, KLL and Sampling, have had more than one *SerVer* over time. There are and will be many *Software Versions* of the same sketch that still use the same *SerVer*. When we have to update the *SerVer*, we provide the capability in the *Software Version* of the code associated with the new *SerVer* the ability to read and convert the old *SerVer* to the new *SerVer*. This is why our newest *Software Versions* can still read and interpret olde r *SerVer* serialized sketches that go back to when our project was started at Yahoo (2012), and before we went open-source (2015). Technically speaking this can be described as *Backward-Transient* compatibility [Schema Evolution and Compatibility](https://docs.confluent.io/platform/current/schema-registry/fundamentals/schema-evolution.html) and [Schema Evolution](https://en.wikipedia.org/wiki/Schema_evolution). + +From the user's perspective, as long as the *SerVer* is the same, older *Software Versions* should be able to read sketch images created by newer *Software Versions*. But the APIs may be different, obviously. An older *Software Version* will not be able to take advantage of new features introduced in new *Software Versions*, but it should be able to do what it did before. In other words, there will be no loss of access to the serialized sketch and the older *Software Version* capabilities. A user should not need to access the *SerVer*, nonetheless it is always stored in index one of the serialized image. If a sketch is presented with a *SerVer* that it is not compatible with, the sketch should throw an exception and say what the problem is, just like Java does with its *Class File Format Versions*. + +#### The Serialized Image of a Sketch +* The structure (or image) of a serialized sketch is independent of the language from which it was created. +* The sketch image only contains little-endian primitives, such as int64, int32, int16, int8, double-64, float-32, UTF-8 strings, and simple array structures of those, which can be easily interpreted in many languages on modern CPUs. We do not support big-endian serialization. +* The sketch image is unique for each type of sketch. +* Simply speaking, a sketch image can be viewed as a blob of bytes, which is easily stored and easily transported using many different protocols, including Protobuf, Avro, Thrift, Byte64, etc. + Review Comment: Excellent point. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
