Also I ask about this in the context of building an optimized encoder.
For this implementation, the resolution will be much simpler if we limit
union to not support two records, similar to the spec do not allow two
array or two map types. I wonder if this limit breaks any significant
use case.
Wai Yip
Wai Yip Tung <mailto:w...@tungwaiyip.info>
Wednesday, June 04, 2014 4:40 PM
For encoding data of union type, the Avro specification do not say a
lot which one of the type in the union is used. So far I am mostly
using union so that I can write null or another simple type. In these
cases, it is fairly obvious for the encoding to distinguish null from
other types.
However a union can also be any named types. So they can be two
records. Let say a Manger record and a NonManager record. I think with
strongly typed languages, the suitable type in the union can be
selected by introspection. But for dynamic languages, these might just
be a represented as maps without any notion of type. In some case, we
may find that the object has all the attributes of a NonManager but
not the Manager. So we can conclude NonManager is the proper schema to
use. But this can get complicated with nested data structure where the
attribute that can disambiguate thing appear in a deeper level. Or you
can think of valid scenario where inspecting the content of the obj
cannot unambiguously resolve the union branch.
I notice that the Python implementation use two pass recursive
validation possible for the reason of for resolving the union choice.
I am wonder if there are much consideration about are potentially
complex, indirectly nested union types that might be difficult to
resolve? Thus adding complexity to the implementation of the encoders?
Are there use case in practice that involve complex union decision?
Wai Yip