Re: Review Request 22457: Improve aurora "job diff" command.

Mark Chu-Carroll Wed, 25 Jun 2014 10:05:26 -0700


> On June 16, 2014, 5:45 p.m., Maxim Khutornenko wrote:
> > src/main/python/apache/aurora/client/cli/jobs.py, lines 181-182
> > <https://reviews.apache.org/r/22457/diff/3/?file=609592#file609592line181>
> >
> >     I don't think it's enough to json-serialize a thrift task. This is 
> > bound to set/dict serialization sorting problem we have battled with in 
> > updater.py. Specifically this block: 
> > https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/client/api/updater.py#L200-L223
> >     
> >     Without sorting, the diff will only work 99% of the time. This has been 
> > demonstrated in a few real-life job updates.
> 
> Mark Chu-Carroll wrote:
>     This isn't a new thing in this updated diff system - this is the current 
> behavior.
>     
>     If you use the new json-tree-diff, you won't have this problem; but for 
> people who rely on the current diff
>     behavior, I'd rather not change that.
>
> 
> Maxim Khutornenko wrote:
>     | If you use the new json-tree-diff, you won't have this problem; 
>     
>     Perhaps I am missing something but how would the new approach ensure 
> equivalence of these two structs?
>     { 'set1': { ['k1':'v1', 'k2':'v2']} }
>     { 'set2': { ['k2':'v2', 'k1':'v1']} }
>     
>     It's "normal" to see element re-ordering like this during thrift struct 
> serialization.
> 
> Maxim Khutornenko wrote:
>     Here is a real TaskConfig json fragment and its reordered twin:
>     
>     1: u'constraints': [{u'name': u'value', u'constraint': {u'values': [u'1', 
> u'2']}}, {u'name': u'limit', u'constraint': {u'limit': {u'limit': 10}}}]
>     2: u'constraints': [{u'constraint': {u'values': [u'2', u'1']}, u'name': 
> u'value'}, {u'name': u'limit', u'constraint': {u'limit': {u'limit': 10}}}]
> 
> Mark Chu-Carroll wrote:
>     I don't think that the reordering that you show in your first example 
> isn't valid json. Key-value pairs belong in a dictionary, not a list.  In a 
> dictionary, the comparison routine does the right thing: it compares key by 
> key, without regard to ordering.
>     
>     In the second example: I'd argue that that's actually a bug in our 
> serializer: it's using an ordered list for an unordered constraint. Diff 
> can't tell that the list structure isn't really supposed to be ordered; the 
> serializer should be canonicalizing it.
>
> 
> Maxim Khutornenko wrote:
>     Not sure how else you could represent a python set in JSON without 
> completely changing the underlying data structure (e.g. emitting fake keys 
> for every element and converting set into a dictionary with sorted keys). 
> Agree, the TJSONProtocol could be a bit smarter when dealing with python sets 
> but the reality is that python sets are inherently "unhashable" structs. I 
> have seen cases when completely repr-equivalent python sets were not 
> comparing as equal.
> 
> Mark Chu-Carroll wrote:
>     All the serializer would need to do is just canonicalize when it converts 
> to a list. It's not an issue of being unhashable. But it should at least 
> bother to be *uniform*. If you're converting a set to a list, chose a 
> convention for ordering elements. Put 'em in lexicographic order, problem 
> solved.
>     
>     As it stands, that problem is unsolvable for this diff tool: there's no 
> way to tell that a given field was supposed to be a set, and should be 
> compared without ordering. 
>     
>     I'm not sure what you'd like me to do here, short of giving up and 
> throwing away the new diff tool. I don't see any way to solve this, short of 
> creating something that uses thrift introspection.
>
> 
> Maxim Khutornenko wrote:
>     We solved this problem in updater diff before (see the link above) and I 
> am sure it can be solved here. Converting to sorted tuples does not produce a 
> pretty diff but if we are targeting value-comparison diff output rather than 
> serialized string compare the problem shifts into pretty-fying the sorted 
> tuple output rather than handling diff mechanics.
> 
> Mark Chu-Carroll wrote:
>     Not sure what you mean by "converting to sorted tuples". Is there any 
> documentation of what the "sorted tuple" format looks like? Or do I just need 
> to reverse-engineer? (I hate doing that, because if I make any wrong 
> assumptions, things get ugly.)
>
> 
> Maxim Khutornenko wrote:
>     I have modified the _diff_configs() in updater.py to print out the output 
> and ran "./pants src/test/python/apache/aurora/client/api:updater -s -k 
> test_diff_unordered_configs":
>     
>     ((u'constraints', (((u'constraint', ((u'limit', ((u'limit', 10),)),)), 
> (u'name', u'limit')), ((u'constraint', ((u'values', (u'1', u'2')),)), 
> (u'name', u'value')))), (u'contactEmail', u'f...@bar.com'), (u'diskMb', 1), 
> (u'environment', u'test'), (u'executorConfig', ((u'data', u'test data'), 
> (u'name', u'test'))), (u'jobName', u'jimbob'), (u'maxTaskFailures', 1), 
> (u'metadata', (((u'key', u'k1'), (u'value', u'v1')), ((u'key', u'k2'), 
> (u'value', u'v2')))), (u'numCpus', 1.0), (u'owner', ((u'role', u'mesos'),)), 
> (u'priority', 0), (u'production', 1), (u'ramMb', 1), (u'requestedPorts', 
> (u'142', u'3424', u'45235')), (u'taskLinks', ((u'task1', u'link1'), 
> (u'task2', u'link2')))) 
>     
>     The idea here is to eliminate the ordering unpredictability by replacing 
> all thrift structs with tuples, which are sortable. It sure isn't pretty but 
> it works.
> 
> Mark Chu-Carroll wrote:
>     I understand the reason for doing it; what I still don't understand is 
> just what the format here *is*. Is there any documentation on it? I can make 
> a reasonable guess at just what the "tuplization" format is and what it 
> means, but it's just a reasonable guess, based on looking at sample outputs 
> and reverse engineering.
>     
>     I don't understand how this fixes anything. As far as I can tell, what it 
> does is just replace the sets, dicts, and structs with tuples. But that 
> doesn't change anything about consistency, as far as I can understand.
>     
>     Let me try to ask in a slightly different way, to get at what I'm not 
> understanding. How does this solve the problem? At some point, you're still 
> taking the set of values whose order is unpredictable, and stuffing them into 
> a tuple, which is ordered. What about this format guarantees that the order 
> in which the elements of a set are converted into a tuple will be consistent?
>     
>     In the example you gave before, you've got a constraints field, which is 
> a set of strings. In normal JSON serialization, that turns into:
>        {u'values': [u'2', u'1']}
>     
>     In the serialization process, the json serializer took a field which was 
> actually a set([u'1', u'2']), and serialized it, producing [u'2', u'1']. The 
> serialization into a list imposed an order on the set. The problem is that we 
> can't count on consistency in how that order is chosen.
>     
>     Now, in this new serialization format, we're doing something very similar 
> - except that instead of serializing set([u'1', u'2']) into [u'1', u'2'], it 
> serializes it as (u'1', u'2'). To serialize it into a tuple, the serialized 
> needed to chose an order for the elements of the set. If I understand you 
> correctly, you're saying that this tuplized format solves the 
> ordering-consistency problem, somehow. But I don't see how. What about tuples 
> guarantees order consistency where lists don't?
>     
>     
>     
>     
>
> 
> Maxim Khutornenko wrote:
>     The key point here is that tuples are sorted at every step during 
> construction and specifically this line [1] ensures the consistency of sets 
> on compare. So, the {u'values': [u'2', u'1']} becomes ((u'values', (u'1', 
> u'2')),)) and so on when unwinding the recursion. 
>     
>     [1] - 
> https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/client/api/updater.py#L202
>     
>     There is no documentation besides what is already there in the 
> _diff_configs(). The concept is quite simple, so I am not sure what else 
> needs to be documented there. Perhaps add an output example?

An output example would definitely help.

Also, more on the rationale - the explanation in that code doesn't explain why 
it makes to go so far - particularly in the case of the new diff algorithm that 
we're discussing.

For example: why convert lists to tuples? Lists have the same ordering 
guarantees and comparison guarantees as tuples. So you're not increasing the 
comparability, but you are making the resulting diffs harder to read?

(In fact, what's going on in that _hashable and _diff_configs seems to not just 
be over-aggressive, but incorrect. A list [b, a] in a constraint expression 
can't be converted to [a, b]. As it stands, that will claim no diffs in cases 
where there's a significant diff, because it erases ordering even when it 
shouldn't.)

Why convert dictionaries to tuples? Dictionaries don't have an ordering 
constraint; key-by-key comparison is fine. If you're dumping the config out to 
a file for comparison by the unix diff tool, that sort-of makes sense, because 
you need the keys to be in the same order for a text diff. But why convert to a 
tuple of tuples? Why not just do a list of tuples? Again, comparison wise it's 
equivalent, but readability-wise, a list of pairs is provides an extra bit of 
visual cue about the structure, which is helpful for the person reading the 
diff.

- Mark

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22457/#review45835
-----------------------------------------------------------

On June 18, 2014, 10:59 a.m., Mark Chu-Carroll wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22457/
> -----------------------------------------------------------
> 
> (Updated June 18, 2014, 10:59 a.m.)
> 
> 
> Review request for Aurora, David McLaughlin and Brian Wickman.
> 
> 
> Bugs: aurora-520
>     https://issues.apache.org/jira/browse/aurora-520
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> Add a new diff method, which uses field-by-field comparison of JSON trees for 
> comparing running job configurations to potentially updated configs.
> 
> - Allow exclusion of semantically irrelevant fields.
> - Provide a clearer list of the differences between configs.
> - Provide a scripting-friendly alternative JSON syntax for diffs.
> 
> The old diff behavior is still available under the "--use-shell-diff" option.
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/client/cli/BUILD 
> ebe681a0d1735b7cc695dc3b7a14c4292d87ae32 
>   src/main/python/apache/aurora/client/cli/jobs.py 
> 4fa03a6c9919651551238b0dc211ed69a8dfe565 
>   src/main/python/apache/aurora/client/cli/json_tree_diff.py PRE-CREATION 
>   src/test/python/apache/aurora/client/cli/BUILD 
> 3c88ed7cf9f654bbbd80d1d44aa1dd1c8655e378 
>   src/test/python/apache/aurora/client/cli/test_diff.py 
> 38629b63c082cf81cb891dace2a70d9e8f418e18 
>   src/test/python/apache/aurora/client/cli/test_json_diff.py PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22457/diff/
> 
> 
> Testing
> -------
> 
> New unit tests of the JSON tree diff code, plus a bunch of new "job diff" 
> tests of the new functionality.
> All tests pass.
> 
> 
> Thanks,
> 
> Mark Chu-Carroll
> 
>

Re: Review Request 22457: Improve aurora "job diff" command.

Reply via email to