> Unless I'm misunderstanding your proposal, that doesn't deal with the data
> that has already been produced that may have been written in a way that
> this change finds non-consumable but works today.

I think what we have determined is that the changes that are being
discussed in this thread would not render any existing serialized
Flatbuffers unreadable, unless they are malformed / unable to be
read with the current libraries.

For example to use Antoine's example

from

table Field {
  ...
  /// This is the type of the decoded value if the field is dictionary encoded.
  type: Type;
  ...

to

  ...
  /// This is the type of the decoded value if the field is dictionary encoded.
  type: Type required;
  ...

Our understanding is that this causes two changes in the generated
Flatbuffers code:

* The "type" field will be null checked on write
* The "type" field will be also null checked on read

The binary encoding itself is unchanged. It's not possible that "type"
is null with any of the current implementations


On Mon, Jan 20, 2020 at 10:47 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Le 20/01/2020 à 17:17, Jacques Nadeau a écrit :
> >>
> >> To be clear, I agree that we need to check that our various validation
> >> and integration suites pass properly.  But once that is done and
> >> assuming all the metadata variations are properly tested, data
> >> variations should not pose any problem.
> >
> > Unless I'm misunderstanding your proposal, that doesn't deal with the data
> > that has already been produced that may have been written in a way that
> > this change finds non-consumable but works today.
>
> Yes, so the question is whether such data can be valid Arrow data at
> all, or it is semantically invalid even though it has a valid encoding.
>  For example, what does a Field mean with a missing datatype?
>
> It can probably help to review which fields would be made required.
> For example, here are the potential candidates in Schema.fbs:
> - Union.typeIds
> - KeyValue.key
> - KeyValue.string
> - Field.type
> - Schema.fields
>
> In Message.fbs:
> - RecordBatch.nodes
> - RecordBatch.buffers
> - DictionaryBatch.data
> - Message.header
>
> In File.fbs:
> - Footer.recordBatches
>
> In Tensor.fbs:
> - Tensor.type
> - Tensor.shape
> - Tensor.strides
> - Tensor.data
>
> In SparseTensor.fbs:
> - SparseTensorIndexCOO.indicesStrides
> - SparseTensorIndexCOO.indicesBuffer
> - SparseMatrixIndexCSX.indptrBuffer
> - SparseMatrixIndexCSX.indicesBuffer
> - SparseTensor.type
> - SparseTensor.shape
> - SparseTensor.sparseIndex
> - SparseTensor.data
>
> >> Of course, we can hand-write all the NULL checks on the read side.  My
> >> concern is not the one-time cost of doing so, but the long-term
> >> fragility of such a strategy
> >
> > I agree with you in principle about using tools rather than humans to
> > minimize mistakes. On the flipside, we chose to use optional for the same
> > reason that flatbuf defaults to optional, protobuf2 recommended use of
> > optional over required and protobuf3 removed the ability to express things
> > as required [1].
>
> Right, flatbuffers recommends optional fields for the ease of evolving
> the format.  One does not have to agree with that opinion.  Evolving the
> encoding flexibly is one thing, keeping all implementations compatible
> with each other (and with past encoded data) is another.  By having
> multiple optional fields, some which become deprecated along the way, a
> combinatory explosion is created.  It's probably less of a concern for
> an in-house protocol than for an open standard like Arrow where there
> may be multiple third-party implementations around at some point.
>
> Regards
>
> Antoine.

Reply via email to