Re: [PR] feat(spec): standardizing fury cross-language serialization specification [incubator-fury]

via GitHub Tue, 26 Mar 2024 04:54:14 -0700


LiangliangSui commented on code in PR #1413:
URL: https://github.com/apache/incubator-fury/pull/1413#discussion_r1539074199



##########
docs/protocols/xlang_object_graph_spec.md:
##########
@@ -0,0 +1,671 @@
+# Cross language object graph serialization
+
+Fury xlang serialization is an automatic object serialization framework that 
supports reference and polymorphism.
+Fury will convert an object from/to fury xlang serialization binary format.
+Fury has two core concepts for xlang serialization:
+
+- **Fury xlang binary format**
+- **Framework implemented in different languages to convert object to/from 
Fury xlang binary format**
+
+The serialization format is a dynamic binary format. The dynamics and 
reference/polymorphism support make Fury flexible,
+much more easy to use, but
+also introduce more complexities compared to static serialization frameworks. 
So the format will be more complex.
+
+## Type Systems
+
+### Data Types
+
+- bool: a boolean value (true or false).
+- int8: a 8-bit signed integer.
+- int16: a 16-bit signed integer.
+- int32: a 32-bit signed integer.
+- var_int32: a 32-bit signed integer which use fury var_int32 encoding.
+- fixed_int32: a 32-bit signed integer which use two's complement encoding.
+- int64: a 64-bit signed integer.
+- var_int64: a 64-bit signed integer which use fury PVL encoding.
+- sli_int64: a 64-bit signed integer which use fury SLI encoding.
+- fixed_int64: a 64-bit signed integer which use two's complement encoding.
+- float16: a 16-bit floating point number.
+- float32: a 32-bit floating point number.
+- float64: a 64-bit floating point number including NaN and Infinity.
+- string: a text string encoded using Latin1/UTF16/UTF-8 encoding.
+- enum: a data type consisting of a set of named values. Rust enum with 
non-predefined field values are not supported as
+  an enum.
+- list: a sequence of objects.
+- set: an unordered set of unique elements.
+- map: a map of key-value pairs.
+- time types:
+    - duration: an absolute length of time, independent of any 
calendar/timezone, as a count of nanoseconds.
+    - timestamp: a point in time, independent of any calendar/timezone, as a 
count of nanoseconds. The count is relative
+      to an epoch at UTC midnight on January 1, 1970.
+- decimal: exact decimal value represented as an integer value in two's 
complement.
+- binary: an variable-length array of bytes.
+- array type: only allow numeric components. Other arrays will be taken as 
List. The implementation should support the
+  interoperability between array and list.
+    - array: multidimensional array which every sub-array can have different 
sizes but all have same type.
+    - bool_array: one dimensional int16 array.
+    - int16_array: one dimensional int16 array.
+    - int32_array: one dimensional int32 array.
+    - int64_array: one dimensional int64 array.
+    - float16_array: one dimensional half_float_16 array.
+    - float32_array: one dimensional float32 array.
+    - float64_array: one dimensional float64 array.
+- tensor: a multidimensional dense array of fixed-size values such as a NumPy 
ndarray.
+- sparse tensor: a multidimensional array whose elements are almost all zeros.
+- arrow record batch: an arrow [record 
batch](https://arrow.apache.org/docs/cpp/tables.html#record-batches) object.
+- arrow table: an arrow 
[table](https://arrow.apache.org/docs/cpp/tables.html#tables) object.
+
+Note:
+
+- Unsigned int/long are not added here, since not every language support those 
types.
+
+### Type disambiguation
+
+Due to differences between type systems of languages, those types can't be 
mapped one-to-one between languages. When
+deserializing, Fury use the target data structure type and the data type in 
the data jointly to determine how to
+deserialize and populate the target data structure. For example:
+
+```java
+class Foo {
+  int[] intArray;
+  Object[] objects;
+  List<Object> objectList;
+}
+
+class Foo2 {
+  int[] intArray;
+  List<Object> objects;
+  List<Object> objectList;
+}
+```
+
+`intArray` has an `int32_array` type. But both `objects` and `objectList` 
fields in the serialize data have `list` data
+type. When deserializing, the implementation will create an `Object` array for 
`objects`, but create a `ArrayList`
+for `objectList` to populate its elements. And the serialized data of `Foo` 
can be deserialized into `Foo2` too.
+
+Users can also provide meta hints for fields of a type, or the type whole. 
Here is an example in java which use
+annotation to provide such information.
+
+```java
+
+@TypeInfo(fieldsNullable = false, trackingRef = false, polymorphic = false)
+class Foo {
+  @FieldInfo(trackingRef = false)
+  int[] intArray;
+  @FieldInfo(polymorphic = true)
+  Object object;
+  @FieldInfo(tagId = 1, nullable = true)
+  List<Object> objectList;
+}
+```
+
+Such information can be provided in other languages too:
+
+- cpp: use macro and template.
+- golang: use struct tag.
+- python: use typehint.
+- rust: use macro.
+
+### Type ID
+
+All internal data types are expressed using an ID in range `-64~-1`. Users can 
use `0~32703` for representing their
+types. At runtime, all type ids are added by `64`, and then encoded as an 
unsigned varint.
+
+### Type mapping
+
+See [Type mapping](../guide/xlang_type_mapping.md)
+
+## Spec overview
+
+Here is the overall format:
+
+```
+| fury header | object ref meta | object type meta | object value data |
+```
+
+The data are serialized using little endian byte order overall. If bytes swap 
is costly for some object,
+Fury will write the byte order for that object into the data instead of 
converting it to little endian.
+
+## Fury header
+
+Fury header consists starts one byte:
+
+```
+|     4 bits    | 1 bit | 1 bit | 1 bit  | 1 bit |          optional 4 bytes   
       |
++---------------+-------+-------+--------+-------+------------------------------------+
+| reserved bits |  oob  | xlang | endian | null  | unsigned int for meta start 
offset |
+```
+
+- null flag: 1 when object is null, 0 otherwise. If an object is null, other 
bits won't be set.
+- endian flag: 1 when data is encoded by little endian, 0 for big endian.
+- xlang flag: 1 when serialization uses xlang format, 0 when serialization 
uses Fury java format.
+- oob flag: 1 when passed `BufferCallback` is not null, 0 otherwise.
+
+If meta share mode is enabled, an uncompressed unsigned int is appended to 
indicate the start offset of metadata.
+
+## Reference Meta
+
+Reference tracking handles whether the object is null, and whether to track 
reference for the object by writing
+corresponding flags and maintaining internal state.
+
+Reference flags:
+
+| Flag                | Byte Value | Description                               
                                                                                
                              |
+|---------------------|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
+| NULL FLAG           | `-3`       | This flag indicates the object is a null 
value. We don't use another byte to indicate REF, so that we can save one byte. 
                               |
+| REF FLAG            | `-2`       | This flag indicates the object is already 
serialized previously, and fury will write a ref id with unsigned varint format 
instead of serialize it again |
+| NOT_NULL VALUE FLAG | `-1`       | This flag indicates the object is a 
non-null value and fury doesn't track ref for this type of object.              
                                    |
+| REF VALUE FLAG      | `0`        | This flag indicates the object is 
referencable and the first time to serialize.                                   
                                      |
+
+When reference tracking is disabled globally or for specific types, or for 
certain types within a particular
+context(e.g., a field of a type), only the `NULL` and `NOT_NULL VALUE` flags 
will be used for reference meta.
+
+For languages which doesn't support reference such as rust, reference tracking 
must be disabled for correct
+deserialization by fury rust implementation.
+
+For languages whose object values are not null by default:
+
+- In rust, Fury takes `Option:None` as a null value
+- In c++, Fury takes `std::nullopt` as a null value
+- In golang, Fury takes `null interface/pointer` as a null value
+
+If one want to deserialize in languages like `Java/Python/JavaScript`, he 
should mark the type with all fields
+not-null by default, or using schema-evolution mode to carry the not-null 
fields info in the data.
+
+## Type Meta
+
+For every type to be serialized, it must be registered with an optional ID 
first. The registered type will have a
+user-provided or an auto-growing unsigned int i.e. `type_id`. The registration 
can be used for security check and type
+identification. The id of user registered type will be added by `64` to make 
space for Fury internal data types.
+
+Depending on whether meta share mode and registration is enabled for current 
type, Fury will write type meta
+differently.
+
+### Schema consistent
+
+If schema consistent mode is enabled globally or enabled for current type, 
type meta will be written as a fury unsigned
+varint of `type_id`.
+
+### Schema evolution
+
+If schema evolution mode is enabled globally or enabled for current type, type 
meta will be written as follows:
+
+- If meta share mode is not enabled:
+    - If type meta hasn't been written before, the data will be written as:
+      ```
+      | unsigned varint: 0b11111111 | type def |
+      ```
+    - If type meta has been written before, the data will be written as:
+      ```
+      | unsigned varint: written index << 1 | type def |
+      ```
+- If meta share mode is enabled:
+    - If type meta hasn't been written before, add `type def` to 
`MetaContext`, then get
+      index of the meta in `MetaContext`, write that index as an unsigned 
varint.
+    - The type meta in `MetaContext` will be carried into the serialized data 
when the type is first serialized in a
+      batch. That what's `optional 4 bytes` used for. It records the start 
offset of serialized type defs which not in
+      provided `MetaContext` in current object graph serialization. Type defs
+      ```python
+      buffer.write_var_uint32(len(writting_type_metas))
+      for type_meta in writting_type_metas:
+          type_meta.write_type_def(buffer)
+      ```
+
+#### Type Def
+
+> This mode will forbid streaming writing since it needs to look back for 
update the start offset after the whole object
+> graph
+> writing and meta collecting is finished. Only in this way we can ensure 
deserialization failure doesn't lost shared
+> meta.
+> Meta streamline will be supported in the future for enclosed meta sharing 
which doesn't cross multiple serializations
+> of different objects.
+
+Here we mainly describe the meta layout for schema evolution mode:
+
+```
+|      8 bytes meta header      |   variable bytes   |  variable bytes   | 
variable bytes |
++-------------------------------+--------------------+-------------------+----------------+
+| 7 bytes hash + 1 bytes header |  current type meta |  parent type meta |     
 ...       |
+```
+
+Type meta are encoded from parent type to leaf type, only type with 
serializable fields will be encoded.
+
+##### Meta header
+
+Meta header is a 64 bits number value encoded in little endian order.
+
+- Lowest 4 digits `0b0000~0b1110` are used to record num classes. `0b1111` is 
preserved to indicate that Fury need to
+  read more bytes for length using Fury unsigned int encoding. If current type 
doesn't has parent type, or parent
+  type doesn't have fields to serialize, or we're in a context which serialize 
fields of current type
+  only, num classes will be 1.
+- The 5th bit is used to indicate whether this type needs schema evolution.
+- Other 56 bits are used to store the unique hash of `flags + all layers type 
meta`.
+
+##### Single layer type meta
+
+```
+| unsigned varint | var uint |  field info: variable bytes   | variable bytes  
| ... |
++-----------------+----------+-------------------------------+-----------------+-----+
+|   num_fields    | type id  | header + type id + field name | next field info 
| ... |
+```
+
+- num fields: encode `num fields` as unsigned varint.
+    - If the current type is schema consistent, then num_fields will be `0` to 
flag it.
+    - If the current type isn't schema consistent, then num_fields will be the 
number of compatible fields. For example,
+      users can use tag id to mark some fields as compatible fields in schema 
consistent context. In such cases, schema
+      consistent fields will be serialized first, then compatible fields will 
be serialized next. At deserialization,
+      Fury will use fields info of those fields which aren't annotated by tag 
id for deserializing schema consistent
+      fields, then use fields info in meta for deserializing compatible fields.
+- type id: the registered id for the current type, which will be written as an 
unsigned varint.
+- field info:
+    - Header(8 bits):
+        - Format:
+            - `reserved 1 bit + 3 bits field name encoding + polymorphism flag 
+ nullability flag + ref tracking flag + tag id flag`.
+        - Users can use annotation to provide that info.
+            - tag id: when set to 1, the field name will be written by an 
unsigned varint tag id.
+            - ref tracking: when set to 0, ref tracking will be disabled for 
this field.
+            - nullability: when set to 0, this field won't be null.
+            - polymorphism: when set to 1, the actual type of field will be 
the declared field type even the type if
+              not `final`.
+            - 3 bits field name encoding will be set to meta string encoding 
flags when tag id is not set.
+    - Type id:
+        - For registered type-consistent classes, it will be the registered 
type id.
+        - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and 
`FINAL_OBJECT_ID` if it's `final`. The
+          meta for such types is written separately instead of inlining here 
is to reduce meta space cost if object of
+          this type is serialized in the current object graph multiple times, 
and the field value may be null too.
+    - List Type Info: this type will have an extra byte for elements info.
+      Users can use annotation to provide that info.
+        - elements type same
+        - elements tracking ref
+        - elements nullability
+        - elements declared type
+    - Map Type Info: this type will have an extra byte for kv items info.
+      Users can use annotation to provide that info.
+        - keys type same
+        - keys tracking ref
+        - keys nullability
+        - keys declared type
+        - values type same
+        - values tracking ref
+        - values nullability
+        - values declared type
+    - Field name: If tag id is set, tag id will be used instead. Otherwise 
meta string encoding length and data will
+      be written instead.
+
+Field order are left as implementation details, which is not exposed to 
specification, the deserialization need to
+resort fields based on Fury field comparator. In this way, fury can compute 
statistics for field names or types and
+using a more compact encoding.
+
+##### Other layers type meta
+
+Same encoding algorithm as the previous layer.
+
+## Meta String
+
+Meta string is mainly used to encode meta strings such as field names.
+
+### Encoding Algorithms
+
+String binary encoding algorithm:
+
+| Algorithm                 | Pattern        | Description                     
                                                                                
                                 |
+|---------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL             | `a-z._$\|`     | every char is written using 5 
bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`                      
                                   |
+| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._$` | every char is written using 6 
bits, `a-z`: `0b00000~0b11110`, `A-Z`: `0b11010~0b110011`, `0~9`: 
`0b110100~0b111101`, `._$`: `0b111110~0b1000000` |
+| UTF-8                     | any chars      | UTF-8 encoding                  
                                                                                
                                 |
+
+Encoding flags:
+
+| Encoding Flag             | Pattern                                          
         | Encoding Algorithm                                                   
                                                               |
+|---------------------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL             | every char is in `a-z._$\|`                      
         | `LOWER_SPECIAL`                                                      
                                                               |
+| REP_FIRST_LOWER_SPECIAL   | every char is in `a-z._$` except first char is 
upper case | replace first upper case char to lower case, then use 
`LOWER_SPECIAL`                                                               |
+| REP_MUL_LOWER_SPECIAL     | every char is in `a-zA-Z._$`                     
         | replace every upper case char by `\|` + `lower case`, then use 
`LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `3` |
+| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z._$`                     
         | use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than 
Encoding `2`                                                          |
+| UTF8                      | any utf-8 char                                   
         | use `UTF-8` encoding                                                 
                                                               |
+| Compression               | any utf-8 char                                   
         | lossless compression                                                 
                                                               |
+
+Depending on cases, one can choose encoding `flags + data` jointly, uses 3 
bits of first byte for flags and other bytes
+for data.
+
+## Value Format
+
+### Basic types
+
+#### bool
+
+- size: 1 byte
+- format: 0 for `false`, 1 for `true`
+
+#### int8
+
+- size: 1 byte
+- format: write as pure byte.
+
+#### int16
+
+- size: 2 byte
+- byte order: little endian order
+
+#### unsigned int32
+
+- size: 1~5 byte
+- Format: The most significant bit (MSB) in every byte indicates whether to 
have the next byte. If first bit is set
+  i.e. `b & 0x80 == 0x80`, then
+  the next byte should be read until the first bit of the next byte is unset.
+
+#### signed int32
+
+- size: 1~5 byte
+- Format: First convert the number into positive unsigned int by `(v << 1) ^ 
(v >> 31)` ZigZag algorithm, then encode
+  it as an unsigned varint.
+
+#### unsigned int64
+
+- size: 1~9 byte
+- Fury SLI(Small long as int) Encoding:
+    - If long is in `[0, 2147483647]`, encode as 4 bytes int: `| 
little-endian: ((int) value) << 1 |`
+    - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
+- Fury PVL(Progressive Variable-length Long) Encoding:
+    - positive long format: first bit in every byte indicates whether to have 
the next byte. If first bit is set
+      i.e. `b & 0x80 == 0x80`, then the next byte should be read until the 
first bit is unset.
+
+#### signed int64
+
+- size: 1~9 byte
+- Fury SLI(Small long as int) Encoding:
+    - If long is in `[-1073741824, 1073741823]`, encode as 4 bytes int: `| 
little-endian: ((int) value) << 1 |`
+    - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
+- Fury PVL(Progressive Variable-length Long) Encoding:
+    - First convert the number into positive unsigned long by `(v << 1) ^ (v 
>> 63)` ZigZag algorithm to reduce cost of
+      small negative numbers, then encoding it as an unsigned long.
+
+#### float32
+
+- size: 4 byte
+- format: encode the specified floating-point value according to the IEEE 754 
floating-point "single format" bit layout,
+  preserving Not-a-Number (NaN) values, then write as binary by little endian 
order.
+
+#### float64
+
+- size: 8 byte
+- format: encode the specified floating-point value according to the IEEE 754 
floating-point "double format" bit layout,
+  preserving Not-a-Number (NaN) values. then write as binary by little endian 
order.
+
+### string
+
+Format:
+
+```
+| header: size << 2 | 2 bits encoding flags | binary data |
+```
+
+- `size + encoding` will be concat as a long and encoded as an unsigned var 
long. The little 2 bits is used for
+  encoding:
+  0 for `latin`, 1 for `utf-16`, 2 for `utf-8`.
+- encoded string binary data based on encoding: `latin/utf-16/utf-8`.
+
+Which encoding to choose:
+
+- For JDK8: fury detect `latin` at runtime, if string is `latin` string, then 
use `latin` encoding, otherwise
+  use `utf-16`.
+- For JDK9+: fury use `coder` in `String` object for encoding, 
`latin`/`utf-16` will be used for encoding.
+- If the string is encoded by `utf-8`, then fury will use `utf-8` to decode 
the data. But currently fury doesn't enable
+  utf-8 encoding by default for java. Cross-language string serialization of 
fury uses `utf-8` by default.
+
+### list
+
+Format:
+
+```
+length(unsigned varint) | elements header | elements data
+```
+
+#### elements header
+
+In most cases, all elements are same type and not null, elements header will 
encode those homogeneous
+information to avoid the cost of writing it for every element. Specifically, 
there are four kinds of information
+which will be encoded by elements header, each use one bit:
+
+- If track elements ref, use the first bit `0b1` of the header to flag it.
+- If the elements have null, use the second bit `0b10` of the header to flag 
it. If ref tracking is enabled for this
+  element type, this flag is invalid.
+- If the element types are not the declared type, use the 3rd bit `0b100` of 
the header to flag it.
+- If the element types are different, use the 4rd bit `0b1000` header to flag 
it.
+
+By default, all bits are unset, which means all elements won't track ref, all 
elements are same type, not null and
+the actual element is the declared type in the custom type field.
+
+#### elements data
+
+Based on the elements header, the serialization of elements data may skip `ref 
flag`/`null flag`/`element type info`.
+
+`CollectionSerializer#write/read` can be taken as an example.

Review Comment:
   Could we add a hyperlink here?



##########
docs/protocols/xlang_object_graph_spec.md:
##########
@@ -0,0 +1,671 @@
+# Cross language object graph serialization
+
+Fury xlang serialization is an automatic object serialization framework that 
supports reference and polymorphism.
+Fury will convert an object from/to fury xlang serialization binary format.
+Fury has two core concepts for xlang serialization:
+
+- **Fury xlang binary format**
+- **Framework implemented in different languages to convert object to/from 
Fury xlang binary format**
+
+The serialization format is a dynamic binary format. The dynamics and 
reference/polymorphism support make Fury flexible,
+much more easy to use, but
+also introduce more complexities compared to static serialization frameworks. 
So the format will be more complex.
+
+## Type Systems
+
+### Data Types
+
+- bool: a boolean value (true or false).
+- int8: a 8-bit signed integer.
+- int16: a 16-bit signed integer.
+- int32: a 32-bit signed integer.
+- var_int32: a 32-bit signed integer which use fury var_int32 encoding.
+- fixed_int32: a 32-bit signed integer which use two's complement encoding.
+- int64: a 64-bit signed integer.
+- var_int64: a 64-bit signed integer which use fury PVL encoding.
+- sli_int64: a 64-bit signed integer which use fury SLI encoding.
+- fixed_int64: a 64-bit signed integer which use two's complement encoding.
+- float16: a 16-bit floating point number.
+- float32: a 32-bit floating point number.
+- float64: a 64-bit floating point number including NaN and Infinity.
+- string: a text string encoded using Latin1/UTF16/UTF-8 encoding.
+- enum: a data type consisting of a set of named values. Rust enum with 
non-predefined field values are not supported as
+  an enum.
+- list: a sequence of objects.
+- set: an unordered set of unique elements.
+- map: a map of key-value pairs.
+- time types:
+    - duration: an absolute length of time, independent of any 
calendar/timezone, as a count of nanoseconds.
+    - timestamp: a point in time, independent of any calendar/timezone, as a 
count of nanoseconds. The count is relative
+      to an epoch at UTC midnight on January 1, 1970.
+- decimal: exact decimal value represented as an integer value in two's 
complement.
+- binary: an variable-length array of bytes.
+- array type: only allow numeric components. Other arrays will be taken as 
List. The implementation should support the
+  interoperability between array and list.
+    - array: multidimensional array which every sub-array can have different 
sizes but all have same type.
+    - bool_array: one dimensional int16 array.
+    - int16_array: one dimensional int16 array.
+    - int32_array: one dimensional int32 array.
+    - int64_array: one dimensional int64 array.
+    - float16_array: one dimensional half_float_16 array.
+    - float32_array: one dimensional float32 array.
+    - float64_array: one dimensional float64 array.
+- tensor: a multidimensional dense array of fixed-size values such as a NumPy 
ndarray.
+- sparse tensor: a multidimensional array whose elements are almost all zeros.
+- arrow record batch: an arrow [record 
batch](https://arrow.apache.org/docs/cpp/tables.html#record-batches) object.
+- arrow table: an arrow 
[table](https://arrow.apache.org/docs/cpp/tables.html#tables) object.
+
+Note:
+
+- Unsigned int/long are not added here, since not every language support those 
types.
+
+### Type disambiguation
+
+Due to differences between type systems of languages, those types can't be 
mapped one-to-one between languages. When
+deserializing, Fury use the target data structure type and the data type in 
the data jointly to determine how to
+deserialize and populate the target data structure. For example:
+
+```java
+class Foo {
+  int[] intArray;
+  Object[] objects;
+  List<Object> objectList;
+}
+
+class Foo2 {
+  int[] intArray;
+  List<Object> objects;
+  List<Object> objectList;
+}
+```
+
+`intArray` has an `int32_array` type. But both `objects` and `objectList` 
fields in the serialize data have `list` data
+type. When deserializing, the implementation will create an `Object` array for 
`objects`, but create a `ArrayList`
+for `objectList` to populate its elements. And the serialized data of `Foo` 
can be deserialized into `Foo2` too.
+
+Users can also provide meta hints for fields of a type, or the type whole. 
Here is an example in java which use
+annotation to provide such information.
+
+```java
+
+@TypeInfo(fieldsNullable = false, trackingRef = false, polymorphic = false)
+class Foo {
+  @FieldInfo(trackingRef = false)
+  int[] intArray;
+  @FieldInfo(polymorphic = true)
+  Object object;
+  @FieldInfo(tagId = 1, nullable = true)
+  List<Object> objectList;
+}
+```
+
+Such information can be provided in other languages too:
+
+- cpp: use macro and template.
+- golang: use struct tag.
+- python: use typehint.
+- rust: use macro.
+
+### Type ID
+
+All internal data types are expressed using an ID in range `-64~-1`. Users can 
use `0~32703` for representing their
+types. At runtime, all type ids are added by `64`, and then encoded as an 
unsigned varint.
+
+### Type mapping
+
+See [Type mapping](../guide/xlang_type_mapping.md)
+
+## Spec overview
+
+Here is the overall format:
+
+```
+| fury header | object ref meta | object type meta | object value data |
+```
+
+The data are serialized using little endian byte order overall. If bytes swap 
is costly for some object,
+Fury will write the byte order for that object into the data instead of 
converting it to little endian.
+
+## Fury header
+
+Fury header consists starts one byte:
+
+```
+|     4 bits    | 1 bit | 1 bit | 1 bit  | 1 bit |          optional 4 bytes   
       |
++---------------+-------+-------+--------+-------+------------------------------------+
+| reserved bits |  oob  | xlang | endian | null  | unsigned int for meta start 
offset |
+```
+
+- null flag: 1 when object is null, 0 otherwise. If an object is null, other 
bits won't be set.
+- endian flag: 1 when data is encoded by little endian, 0 for big endian.
+- xlang flag: 1 when serialization uses xlang format, 0 when serialization 
uses Fury java format.
+- oob flag: 1 when passed `BufferCallback` is not null, 0 otherwise.
+
+If meta share mode is enabled, an uncompressed unsigned int is appended to 
indicate the start offset of metadata.
+
+## Reference Meta
+
+Reference tracking handles whether the object is null, and whether to track 
reference for the object by writing
+corresponding flags and maintaining internal state.
+
+Reference flags:
+
+| Flag                | Byte Value | Description                               
                                                                                
                              |
+|---------------------|------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
+| NULL FLAG           | `-3`       | This flag indicates the object is a null 
value. We don't use another byte to indicate REF, so that we can save one byte. 
                               |
+| REF FLAG            | `-2`       | This flag indicates the object is already 
serialized previously, and fury will write a ref id with unsigned varint format 
instead of serialize it again |
+| NOT_NULL VALUE FLAG | `-1`       | This flag indicates the object is a 
non-null value and fury doesn't track ref for this type of object.              
                                    |
+| REF VALUE FLAG      | `0`        | This flag indicates the object is 
referencable and the first time to serialize.                                   
                                      |
+
+When reference tracking is disabled globally or for specific types, or for 
certain types within a particular
+context(e.g., a field of a type), only the `NULL` and `NOT_NULL VALUE` flags 
will be used for reference meta.
+
+For languages which doesn't support reference such as rust, reference tracking 
must be disabled for correct
+deserialization by fury rust implementation.
+
+For languages whose object values are not null by default:
+
+- In rust, Fury takes `Option:None` as a null value
+- In c++, Fury takes `std::nullopt` as a null value
+- In golang, Fury takes `null interface/pointer` as a null value
+
+If one want to deserialize in languages like `Java/Python/JavaScript`, he 
should mark the type with all fields
+not-null by default, or using schema-evolution mode to carry the not-null 
fields info in the data.
+
+## Type Meta
+
+For every type to be serialized, it must be registered with an optional ID 
first. The registered type will have a
+user-provided or an auto-growing unsigned int i.e. `type_id`. The registration 
can be used for security check and type
+identification. The id of user registered type will be added by `64` to make 
space for Fury internal data types.
+
+Depending on whether meta share mode and registration is enabled for current 
type, Fury will write type meta
+differently.
+
+### Schema consistent
+
+If schema consistent mode is enabled globally or enabled for current type, 
type meta will be written as a fury unsigned
+varint of `type_id`.
+
+### Schema evolution
+
+If schema evolution mode is enabled globally or enabled for current type, type 
meta will be written as follows:
+
+- If meta share mode is not enabled:
+    - If type meta hasn't been written before, the data will be written as:
+      ```
+      | unsigned varint: 0b11111111 | type def |
+      ```
+    - If type meta has been written before, the data will be written as:
+      ```
+      | unsigned varint: written index << 1 | type def |
+      ```
+- If meta share mode is enabled:
+    - If type meta hasn't been written before, add `type def` to 
`MetaContext`, then get
+      index of the meta in `MetaContext`, write that index as an unsigned 
varint.
+    - The type meta in `MetaContext` will be carried into the serialized data 
when the type is first serialized in a
+      batch. That what's `optional 4 bytes` used for. It records the start 
offset of serialized type defs which not in
+      provided `MetaContext` in current object graph serialization. Type defs
+      ```python
+      buffer.write_var_uint32(len(writting_type_metas))
+      for type_meta in writting_type_metas:
+          type_meta.write_type_def(buffer)
+      ```
+
+#### Type Def
+
+> This mode will forbid streaming writing since it needs to look back for 
update the start offset after the whole object
+> graph
+> writing and meta collecting is finished. Only in this way we can ensure 
deserialization failure doesn't lost shared
+> meta.
+> Meta streamline will be supported in the future for enclosed meta sharing 
which doesn't cross multiple serializations
+> of different objects.
+
+Here we mainly describe the meta layout for schema evolution mode:
+
+```
+|      8 bytes meta header      |   variable bytes   |  variable bytes   | 
variable bytes |
++-------------------------------+--------------------+-------------------+----------------+
+| 7 bytes hash + 1 bytes header |  current type meta |  parent type meta |     
 ...       |
+```
+
+Type meta are encoded from parent type to leaf type, only type with 
serializable fields will be encoded.
+
+##### Meta header
+
+Meta header is a 64 bits number value encoded in little endian order.
+
+- Lowest 4 digits `0b0000~0b1110` are used to record num classes. `0b1111` is 
preserved to indicate that Fury need to
+  read more bytes for length using Fury unsigned int encoding. If current type 
doesn't has parent type, or parent
+  type doesn't have fields to serialize, or we're in a context which serialize 
fields of current type
+  only, num classes will be 1.
+- The 5th bit is used to indicate whether this type needs schema evolution.
+- Other 56 bits are used to store the unique hash of `flags + all layers type 
meta`.
+
+##### Single layer type meta
+
+```
+| unsigned varint | var uint |  field info: variable bytes   | variable bytes  
| ... |
++-----------------+----------+-------------------------------+-----------------+-----+
+|   num_fields    | type id  | header + type id + field name | next field info 
| ... |
+```
+
+- num fields: encode `num fields` as unsigned varint.
+    - If the current type is schema consistent, then num_fields will be `0` to 
flag it.
+    - If the current type isn't schema consistent, then num_fields will be the 
number of compatible fields. For example,
+      users can use tag id to mark some fields as compatible fields in schema 
consistent context. In such cases, schema
+      consistent fields will be serialized first, then compatible fields will 
be serialized next. At deserialization,
+      Fury will use fields info of those fields which aren't annotated by tag 
id for deserializing schema consistent
+      fields, then use fields info in meta for deserializing compatible fields.
+- type id: the registered id for the current type, which will be written as an 
unsigned varint.
+- field info:
+    - Header(8 bits):
+        - Format:
+            - `reserved 1 bit + 3 bits field name encoding + polymorphism flag 
+ nullability flag + ref tracking flag + tag id flag`.
+        - Users can use annotation to provide that info.
+            - tag id: when set to 1, the field name will be written by an 
unsigned varint tag id.
+            - ref tracking: when set to 0, ref tracking will be disabled for 
this field.
+            - nullability: when set to 0, this field won't be null.
+            - polymorphism: when set to 1, the actual type of field will be 
the declared field type even the type if
+              not `final`.
+            - 3 bits field name encoding will be set to meta string encoding 
flags when tag id is not set.
+    - Type id:
+        - For registered type-consistent classes, it will be the registered 
type id.
+        - Otherwise it will be encoded as `OBJECT_ID` if it isn't `final` and 
`FINAL_OBJECT_ID` if it's `final`. The
+          meta for such types is written separately instead of inlining here 
is to reduce meta space cost if object of
+          this type is serialized in the current object graph multiple times, 
and the field value may be null too.
+    - List Type Info: this type will have an extra byte for elements info.
+      Users can use annotation to provide that info.
+        - elements type same
+        - elements tracking ref
+        - elements nullability
+        - elements declared type
+    - Map Type Info: this type will have an extra byte for kv items info.
+      Users can use annotation to provide that info.
+        - keys type same
+        - keys tracking ref
+        - keys nullability
+        - keys declared type
+        - values type same
+        - values tracking ref
+        - values nullability
+        - values declared type
+    - Field name: If tag id is set, tag id will be used instead. Otherwise 
meta string encoding length and data will
+      be written instead.
+
+Field order are left as implementation details, which is not exposed to 
specification, the deserialization need to
+resort fields based on Fury field comparator. In this way, fury can compute 
statistics for field names or types and
+using a more compact encoding.
+
+##### Other layers type meta
+
+Same encoding algorithm as the previous layer.
+
+## Meta String
+
+Meta string is mainly used to encode meta strings such as field names.
+
+### Encoding Algorithms
+
+String binary encoding algorithm:
+
+| Algorithm                 | Pattern        | Description                     
                                                                                
                                 |
+|---------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL             | `a-z._$\|`     | every char is written using 5 
bits, `a-z`: `0b00000~0b11001`, `._$\|`: `0b11010~0b11101`                      
                                   |
+| LOWER_UPPER_DIGIT_SPECIAL | `a-zA-Z0~9._$` | every char is written using 6 
bits, `a-z`: `0b00000~0b11110`, `A-Z`: `0b11010~0b110011`, `0~9`: 
`0b110100~0b111101`, `._$`: `0b111110~0b1000000` |
+| UTF-8                     | any chars      | UTF-8 encoding                  
                                                                                
                                 |
+
+Encoding flags:
+
+| Encoding Flag             | Pattern                                          
         | Encoding Algorithm                                                   
                                                               |
+|---------------------------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------|
+| LOWER_SPECIAL             | every char is in `a-z._$\|`                      
         | `LOWER_SPECIAL`                                                      
                                                               |
+| REP_FIRST_LOWER_SPECIAL   | every char is in `a-z._$` except first char is 
upper case | replace first upper case char to lower case, then use 
`LOWER_SPECIAL`                                                               |
+| REP_MUL_LOWER_SPECIAL     | every char is in `a-zA-Z._$`                     
         | replace every upper case char by `\|` + `lower case`, then use 
`LOWER_SPECIAL`, use this encoding if it's smaller than Encoding `3` |
+| LOWER_UPPER_DIGIT_SPECIAL | every char is in `a-zA-Z._$`                     
         | use `LOWER_UPPER_DIGIT_SPECIAL` encoding if it's smaller than 
Encoding `2`                                                          |
+| UTF8                      | any utf-8 char                                   
         | use `UTF-8` encoding                                                 
                                                               |
+| Compression               | any utf-8 char                                   
         | lossless compression                                                 
                                                               |
+
+Depending on cases, one can choose encoding `flags + data` jointly, uses 3 
bits of first byte for flags and other bytes
+for data.
+
+## Value Format
+
+### Basic types
+
+#### bool
+
+- size: 1 byte
+- format: 0 for `false`, 1 for `true`
+
+#### int8
+
+- size: 1 byte
+- format: write as pure byte.
+
+#### int16
+
+- size: 2 byte
+- byte order: little endian order
+
+#### unsigned int32
+
+- size: 1~5 byte
+- Format: The most significant bit (MSB) in every byte indicates whether to 
have the next byte. If first bit is set
+  i.e. `b & 0x80 == 0x80`, then
+  the next byte should be read until the first bit of the next byte is unset.
+
+#### signed int32
+
+- size: 1~5 byte
+- Format: First convert the number into positive unsigned int by `(v << 1) ^ 
(v >> 31)` ZigZag algorithm, then encode
+  it as an unsigned varint.
+
+#### unsigned int64
+
+- size: 1~9 byte
+- Fury SLI(Small long as int) Encoding:
+    - If long is in `[0, 2147483647]`, encode as 4 bytes int: `| 
little-endian: ((int) value) << 1 |`
+    - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
+- Fury PVL(Progressive Variable-length Long) Encoding:
+    - positive long format: first bit in every byte indicates whether to have 
the next byte. If first bit is set
+      i.e. `b & 0x80 == 0x80`, then the next byte should be read until the 
first bit is unset.
+
+#### signed int64
+
+- size: 1~9 byte
+- Fury SLI(Small long as int) Encoding:
+    - If long is in `[-1073741824, 1073741823]`, encode as 4 bytes int: `| 
little-endian: ((int) value) << 1 |`
+    - Otherwise write as 9 bytes: `| 0b1 | little-endian 8 bytes long |`
+- Fury PVL(Progressive Variable-length Long) Encoding:
+    - First convert the number into positive unsigned long by `(v << 1) ^ (v 
>> 63)` ZigZag algorithm to reduce cost of
+      small negative numbers, then encoding it as an unsigned long.
+
+#### float32
+
+- size: 4 byte
+- format: encode the specified floating-point value according to the IEEE 754 
floating-point "single format" bit layout,
+  preserving Not-a-Number (NaN) values, then write as binary by little endian 
order.
+
+#### float64
+
+- size: 8 byte
+- format: encode the specified floating-point value according to the IEEE 754 
floating-point "double format" bit layout,
+  preserving Not-a-Number (NaN) values. then write as binary by little endian 
order.
+
+### string
+
+Format:
+
+```
+| header: size << 2 | 2 bits encoding flags | binary data |
+```
+
+- `size + encoding` will be concat as a long and encoded as an unsigned var 
long. The little 2 bits is used for
+  encoding:
+  0 for `latin`, 1 for `utf-16`, 2 for `utf-8`.
+- encoded string binary data based on encoding: `latin/utf-16/utf-8`.
+
+Which encoding to choose:
+
+- For JDK8: fury detect `latin` at runtime, if string is `latin` string, then 
use `latin` encoding, otherwise
+  use `utf-16`.
+- For JDK9+: fury use `coder` in `String` object for encoding, 
`latin`/`utf-16` will be used for encoding.
+- If the string is encoded by `utf-8`, then fury will use `utf-8` to decode 
the data. But currently fury doesn't enable
+  utf-8 encoding by default for java. Cross-language string serialization of 
fury uses `utf-8` by default.
+
+### list
+
+Format:
+
+```
+length(unsigned varint) | elements header | elements data
+```
+
+#### elements header
+
+In most cases, all elements are same type and not null, elements header will 
encode those homogeneous
+information to avoid the cost of writing it for every element. Specifically, 
there are four kinds of information
+which will be encoded by elements header, each use one bit:

Review Comment:
   Can `length+elements` also be concat as a long and encoded as an unsigned 
var long like string type?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@fury.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@fury.apache.org
For additional commands, e-mail: commits-h...@fury.apache.org

Re: [PR] feat(spec): standardizing fury cross-language serialization specification [incubator-fury]

Reply via email to