[GitHub] [arrow] lidavidm commented on a change in pull request #11646: ARROW-14634: [Flatbuffers] introduction of ColumnBag

GitBox Mon, 08 Nov 2021 14:11:04 -0800


lidavidm commented on a change in pull request #11646:
URL: https://github.com/apache/arrow/pull/11646#discussion_r745128442




##########
File path: format/Message.fbs
##########
@@ -117,6 +117,40 @@ table DictionaryBatch {
   isDelta: bool = false;
 }
 
+/// A range of field nodes, identified by their offset in the schema.
+/// The offsets are zero-indexed.
+struct FieldNodeRange {
+  /// The starting offset (inclusive)
+  start: long;
+
+  /// The ending offset (exclusive)
+  end: long;
+}
+
+/// A data header describing the shared memory layout of a "bag" of "columns".
+/// It is similar to a RecordBatch but not every top level FieldNode is 
required
+/// to be included in the wire payload.
+table ColumnBag {
+  /// If not provided, all field nodes are included and this payload is
+  /// identical to a RecordBatch. Otherwise the reader needs to skip
+  /// top level FieldNodes that were not included.
+  includedNodes: [FieldNodeRange];

Review comment:
       So to be clear, we can't do something like provide only a nested array - 
and implementations will need to validate that this only skips entire top level 
fields?

##########
File path: format/Schema.fbs
##########
@@ -37,14 +37,20 @@ enum MetadataVersion:short {
   /// >= 0.8.0 (December 2017). Non-backwards compatible with V3.
   V4,
 
-  /// >= 1.0.0 (July 2020. Backwards compatible with V4 (V5 readers can read V4
+  /// 1.0.0 -> 6.0.0 (July 2020). Backwards compatible with V4 (V5 readers can 
read V4
   /// metadata and IPC messages). Implementations are recommended to provide a
   /// V4 compatibility mode with V5 format changes disabled.
   ///
   /// Incompatible changes between V4 and V5:
   /// - Union buffer layout has changed. In V5, Unions don't have a validity
   ///   bitmap buffer.
   V5,
+
+  /// >= 7.0.0 (Jan 2022). Backwards compatible with V4 and V5.
+  ///
+  /// Adds ColumnBag to wire format. It has looser restrictions than 
RecordBatch but is
+  /// otherwise similar in intent.
+  V6

Review comment:
       It doesn't look like we incremented the metadata version when we added 
Tensor - is this necessary? (I'm not so familiar with Flatbuffers.) I suppose 
if we fold the binary metadata change into this we would need this.

##########
File path: format/Message.fbs
##########
@@ -117,6 +117,40 @@ table DictionaryBatch {
   isDelta: bool = false;
 }
 
+/// A range of field nodes, identified by their offset in the schema.
+/// The offsets are zero-indexed.
+struct FieldNodeRange {
+  /// The starting offset (inclusive)
+  start: long;
+
+  /// The ending offset (exclusive)
+  end: long;
+}
+
+/// A data header describing the shared memory layout of a "bag" of "columns".
+/// It is similar to a RecordBatch but not every top level FieldNode is 
required
+/// to be included in the wire payload.

Review comment:
       Thinking ahead - how do the APIs for this look like? In Java, this would 
be a "ragged" VectorSchemaRoot?

##########
File path: format/Message.fbs
##########
@@ -117,6 +117,40 @@ table DictionaryBatch {
   isDelta: bool = false;
 }
 
+/// A range of field nodes, identified by their offset in the schema.
+/// The offsets are zero-indexed.
+struct FieldNodeRange {
+  /// The starting offset (inclusive)
+  start: long;
+
+  /// The ending offset (exclusive)
+  end: long;
+}
+
+/// A data header describing the shared memory layout of a "bag" of "columns".
+/// It is similar to a RecordBatch but not every top level FieldNode is 
required
+/// to be included in the wire payload.
+table ColumnBag {
+  /// If not provided, all field nodes are included and this payload is
+  /// identical to a RecordBatch. Otherwise the reader needs to skip
+  /// top level FieldNodes that were not included.

Review comment:
       The corresponding Buffers will also presumably be skipped?

##########
File path: format/Message.fbs
##########
@@ -117,6 +117,40 @@ table DictionaryBatch {
   isDelta: bool = false;
 }
 
+/// A range of field nodes, identified by their offset in the schema.
+/// The offsets are zero-indexed.
+struct FieldNodeRange {
+  /// The starting offset (inclusive)
+  start: long;
+
+  /// The ending offset (exclusive)
+  end: long;
+}
+
+/// A data header describing the shared memory layout of a "bag" of "columns".
+/// It is similar to a RecordBatch but not every top level FieldNode is 
required
+/// to be included in the wire payload.
+table ColumnBag {
+  /// If not provided, all field nodes are included and this payload is
+  /// identical to a RecordBatch. Otherwise the reader needs to skip
+  /// top level FieldNodes that were not included.
+  includedNodes: [FieldNodeRange];

Review comment:
       Can we mark this as experimental like how Tensor does? 
https://github.com/apache/arrow/blob/939db7f513a56d0cab12a8479d8153e5aa2ae1df/format/Tensor.fbs#L18-L20




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on a change in pull request #11646: ARROW-14634: [Flatbuffers] introduction of ColumnBag

Reply via email to