Re: [PR] feat(spec): standardizing fury cross-language serialization specification [incubator-fury]

via GitHub Mon, 25 Mar 2024 03:10:20 -0700


chaokunyang commented on code in PR #1413:
URL: https://github.com/apache/incubator-fury/pull/1413#discussion_r1537344486



##########
docs/protocols/xlang_object_graph_spec.md:
##########
@@ -0,0 +1,635 @@
+# Cross language object graph serialization
+
+Fury xlang serialization is an automatic object serialization framework that 
supports reference and polymorphism.
+Fury will convert an object from/to fury xlang serialization binary format.
+Fury has two core concepts for xlang serialization:
+
+- **Fury xlang binary format**
+- **Framework implemented in different languages to convert object to/from 
Fury xlang binary format**
+
+The serialization format is a dynamic binary format. The dynamics and 
reference/polymorphism support make Fury flexible,
+much more easy to use, but
+also introduce more complexities compared to static serialization frameworks. 
So the format will be more complex.
+
+## Type Systems
+
+### Data Types
+
+- bool: a boolean value (true or false).
+- int8: a 8-bit signed integer.
+- int16: a 16-bit signed integer.
+- int32: a 32-bit signed integer.
+- int64: a 64-bit signed integer.
+- float16: a 16-bit floating point number.
+- float32: a 32-bit floating point number.
+- float64: a 64-bit floating point number including NaN and Infinity.
+- string: a text string encoded using Latin1/UTF16/UTF-8 encoding.
+- enum: a data type consisting of a set of named values. Rust enum with 
non-predefined field values are not supported as
+  an enum.
+- list: a sequence of objects.
+- set: an unordered set of unique elements.
+- map: a map of key-value pairs.
+- time types:
+    - duration: an absolute length of time, independent of any 
calendar/timezone, as a count of nanoseconds.
+    - timestamp: a point in time, independent of any calendar/timezone, as a 
count of nanoseconds. The count is relative
+      to an epoch at UTC midnight on January 1, 1970.
+- decimal: exact decimal value represented as an integer value in two's 
complement.
+- binary: an variable-length array of bytes.
+- array type: only allow numeric component. Other arrays will be taken as 
List. The implementation should support the
+  interoperability between array and list.
+    - array: multidimensional array which every sub-array can have different 
size but all have same type.
+    - bool_array: one dimension int16 array.
+    - int16_array: one dimension int16 array.
+    - int32_array: one dimension int32 array.
+    - int64_array: one dimension int64 array.
+    - float16_array: one dimension half_float_16 array.
+    - float32_array: one dimension float32 array.
+    - float64_array: one dimension float64 array.
+- tensor: a multidimensional dense array of fixed-size values such as a NumPy 
ndarray.
+- sparse tensor: a multidimensional array whose elements are almost all zeros.
+- arrow record batch: an arrow [record 
batch](https://arrow.apache.org/docs/cpp/tables.html#record-batches) object.
+- arrow table: an arrow 
[table](https://arrow.apache.org/docs/cpp/tables.html#tables) object.
+
+### Type disambiguation
+
+Due to differences between type systems of languages, those types can't be 
mapped one-to-one between languages. When
+deserializing, Fury use the target data structure type and the data type in 
the data jointly to determine how to
+deserialize and populate the target data structure. For example:
+
+```java
+class Foo {
+  int[] intArray;
+  Object[] objects;
+  List<Object> objectList;
+}
+
+class Foo2 {
+  int[] intArray;
+  List<Object> objects;
+  List<Object> objectList;
+}
+```
+
+`intArray` has `int32_array` type. But both `objects` and `objectList` field 
in the serialize data have `list` data
+type. When deserializing, the implementation will create an `Object` array for 
`objects`, but create a `ArrayList`
+for `objectList` to populate it's elements. And the serialized data of `Foo` 
can be deserialized into `Foo2` too.
+
+Users can also provide meta hint for fields of a type, or the type whole. Here 
is an example in java which use
+annotation to provide such information.
+
+```java
+
+@TypeInfo(fieldsNullable = false, trackingRef = false, polymorphic = false)
+class Foo {
+  @FieldInfo(trackingRef = false)
+  int[] intArray;
+  @FieldInfo(polymorphic = true)
+  Object object;
+  @FieldInfo(tagId = 1, nullable = true)
+  List<Object> objectList;
+}
+```
+
+Such information can be provided in other languages too:
+
+- cpp: use macro and template.
+- golang: use struct tag.
+- python: use typehint.
+- rust: use macro.
+
+### Type ID
+
+All internal data types are expressed using a ID in range `-64~-1`. Users can 
use `0~32703` for representing their

Review Comment:
   If we extend it to 128, user registeded types will be encoded by 2 bytes, 
which will bloat the data.
   
   I have two solutions:
   - Preserve type id range `32000~32703` for extending
   - Use `-96~1` for internal ID, but it only give 32 ids for users to encode 
type in one byte



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@fury.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@fury.apache.org
For additional commands, e-mail: commits-h...@fury.apache.org

Re: [PR] feat(spec): standardizing fury cross-language serialization specification [incubator-fury]

Reply via email to