Jens,

I think I agree with you on all points.  Let me restate, to be sure I got
it right:

(1) if t_const_value uses a union like a sort-of-struct, then that's a
bug.  So let's fix it.

(2) It's possible that the wireline has "unions" with more than one field.
Generated code should be able to read that and produce valid unions
in-memory.  And of course, generated code, used correctly, should produce
wirelines with exactly one field.

(3) I thought about what you propose: replacing the "isset" bits with an
index field (initialized to -1, I guess, for "no field yet").  I think that
eventually we should do that, but for the short-term,we can just do what I
suggested, wihch is to zero out the bits on each field-write before setting
the bit for that field.  Let me try to argue for this change:

(a) it's efficient: the zeroing (of for example the isset for
t_const_value) can be done with (for example):

 __isset  = _t_const_value_isset() ;

The optimization of bit-field accesses is a pretty old thing for compilers,
and so I think it's fair to assume that the initializer will be a
single-word assignment (up to 32/64 fields).  So it should be a wash,
perf-wise.

[Just out of curiousity, I wrote a program with 64 bit fields, and looked
at the assembler.  Then I reduced it to 20 fields, and compared.  Just to
see if there was a difference in size (which there would have to be, if
there was anything related to # of fields, affecting code.  No change.
Sadly, the last assembler I knew was VAX, so I couldn't actually understand
the assembler ... <sigh>]

(b) it's compatible with people using unions today.  We can ensure that
read()/write() only produce/consume valid unions, but leave the in-memory
interface the same.  I would argue that this is important b/c today,
there's no "abstract" interface to the isset() fields.  People go at them
directly.  Heck, I bet they -set- them directly, and don't just read 'em.
Before we change thae implementation of isset, I'd think we should put an
absttraction barrier there.

At least, we need such an abstraction barrier for the actual fields, don't
we?

What I'm saying is, we can maintain the current in-memory API, fix the
rules for what constitutes a valid union (in code) and THEN design a proper
in-memory API for unions, which can replace the current one?

How about that?

--chet--

On Sat, Dec 2, 2017 at 9:09 AM, Jens Geyer <jensge...@hotmail.com> wrote:

> Hi,
>
> > (2) this means that if code sets two fields of a union, both fields get
> > sent on-the-wire, and the receiving end deserializes both fields.  So it
> > really isn't a union, and it certianly doesn't "provide a means to
> > transport exactly one field of a possible set of fieldsprovide a means to
> > transport exactly one field of a possible set of fields"
>
> and
>
> > (7) More generally, anybody who uses unions, and relies on them really
> > being "structs with all optional fields" will find that they can't rely
> on
> > that anymore.
>
> I think we have to differentiate between what is on the wire and how the
> library and generated code handles the situation.
>
> Technically it could be possible to have two union fields set, which would
> indeed be incorrect. Technically I also can craft a serialized struct that
> misses a "required" field or has two values for the same field ID, data for
> an undefined field ID, or whatever else comes to mind. All of these are
> transport-wise corect but *semnatically* invalid wire data w/respect to the
> protocol level. Therefore the code must be able to handle this. And it
> does:
> it skips unknown field IDs, it uses only the last field value and it
> (usually) complains about missing "required" fields.
>
> But all of this is code. The wire data are not affected in any way by this.
> The wire data do not even contain a flag for "required" because we don't
> need it, because the knowledge is in the generated code. And the same
> should - in my opinion - be true for the union handling. And since we only
> have to change code, not wire formats, there should be no compat issue.
>
>
> > (3) It's easy to fix this -- each set_field clears all the other ones.  I
> > won't worry for the moment about making that efficient (though it
> probably
> > doesn't need to be) but it can be made efficient using C unions (yes?)
>
> Mmmh. But that's a O(N) operation, even if we only set and clear bits. And
> not all laguages do actually have unions.
>
> Proposal: If we change it in a way where we not have a isset field flags
> set
> for unions, but only store the currently active ID, that would be an O(1)
> operation then. The flag set could still be provided as calculated property
> for (backward) compatibility. The ID would also have the benefit of being
> able to write things like
>
>     switch( myUnion.issetFieldID) {
>         case 1:  DoThis(myUnion.one); break;
>         case 2:  DoThat(myUnion.two); break;
>         case 3:  DoSomethingElse(myUnion.three); break;
>         default: Console.Write("You loose"); break;
>     }
>
> instead of
>
>
>     if( myUnion.isset_one) {
>         DoThis(myUnion.one);
>     } else if( myUnion.isset_one) {
>         case 2:  DoThat(myUnion.two);
>     } else if( myUnion.isset_one) {
>         case 3:  DoSomethingElse(myUnion.three);
>     } else {
>         Console.Write("You loose"); break;
>     }
>
> which would then also be an O(1) op instead of O(N).
>
>
> Re (4), (5), (6):
>
> union t_const_value {
>   1: optional map<t_const_value, t_const_value> map_val
>   2: optional list<t_const_value> list_val
>   3: optional string string_val
>   4: optional i64 integer_val
>   5: optional double double_val
>   6: optional string identifier_val
>   7: optional t_type_id enum_val
>
> > and those fields [6 and 7] are -not- alternatives: they are BOTH SET, or
> > NEITHER ARE SET.
>
> If that is true, the union choice here is simply incorrect. If it works
> today, it relies on undefined behavuiour and that certainly can't be a good
> thing as any bug fix can potentially break it. That relates exactly to what
> I wrote earlier:
>
> > If unions are used *as they are supposed to be*, this
> > should be a nonbreaker. If they arent, it is wrong anyway.
>
> So let's fix it.
>
> Have fun,
> JensG
>
>
> -----Ursprüngliche Nachricht-----
> From: Chet Murthy
> Sent: Saturday, December 2, 2017 12:28 AM
> To: dev@thrift.apache.org
> Subject: Re: Thrift unions don't work as documented
>
> Jens,
>
> YES!  That's why I sent the email, rather than diving into code!  *grin*
>
> (1) the current implementation in C++ and Ocaml treats a union as a struct
> with all-optional fields.
>
> (2) this means that if code sets two fields of a union, both fields get
> sent on-the-wire, and the receiving end deserializes both fields.  So it
> really isn't a union, and it certianly doesn't "provide a means to
> transport exactly one field of a possible set of fieldsprovide a means to
> transport exactly one field of a possible set of fields"
>
> (3) It's easy to fix this -- each set_field clears all the other ones.  I
> won't worry for the moment about making that efficient (though it probably
> doesn't need to be) but it can be made efficient using C unions (yes?)
>
> (4) But this *will* be a breaking change.  B/c for instance, in
> plugin.thrift, t_const_value has fields
>
>   6: optional string identifier_val
>   7: optional t_type_id enum_val
>
> and those fields are -not- alternatives: they are BOTH SET, or NEITHER ARE
> SET.
>
> (5) So at a minimum, we'd need to replace that wiith something like
>
> struct t_const_identifier_value {
>   1: required string identifier_val
>   2: required t_type_id enum_val
> }
>
> and then replace those two fields with
>
>   8: optional t_const_identifier_value const_identifier_val
>
> (6) This will break any code people have written that uses this plugin
> interface (b/c types change in non-compatible ways)
>
> (7) More generally, anybody who uses unions, and relies on them really
> being "structs with all optional fields" will find that they can't rely on
> that anymore.
>
> I would note that neither of these is a big problem.  But a breaking change
> is a breaking change, right?
>
> Anyway, that's the story.  I think you can see, that writing the code is
> the easy part.
>
> If you think this is OK to do, I can probably prepare a PR quickly.
>
> --chet--
>
> P.S. Why did I look into this subject?  B/c I want to write a new ocaml
> library, that will use "idiomatic" Ocaml types, and hopefully be much more
> efficient.  But certainly be easier-to-use for Ocaml programmers.  Thrift
> unions should be mapped to Ocaml union types ("constructor datatypes").
> And those .... well, they really *are* exactly one field can be set at any
> time.  So I figure, yes I can write my new compiler and library.  But it'll
> be incompatible with the current Thrift contract.  So I figured, I should
> look into whether that contract can be fixed, while I hack my code.
>
>
>
> On Fri, Dec 1, 2017 at 2:48 PM, Jens Geyer <jensge...@hotmail.com> wrote:
>
> >
> >
> > If you're not sure, you can discuss it. If you are sure, provide a PR
> > (ideally one per language) and we look at it.
> >
> > From my feelings, I'm not really sure if I fully understand the problem
> > and
> > how you plan to fix it. Maybe it is a good idea to sketch the idea behind
> > and get some comments before you deep-dive in the code.
> >
>
>

Reply via email to