Re: [PATCH 0/2] routines to generate JSON data
On 3/20/2018 1:42 AM, Jeff King wrote: On Mon, Mar 19, 2018 at 06:19:26AM -0400, Jeff Hostetler wrote: To make the above work, I think you'd have to store a little more state. E.g., the "array_append" functions check "out->len" to see if they need to add a separating comma. That wouldn't work if we might be part of a nested array. So I think you'd need a context struct like: struct json_writer { int first_item; struct strbuf out; }; #define JSON_WRITER_INIT { 1, STRBUF_INIT } to store the state and the output. As a bonus, you could also use it to store some other sanity checks (e.g., keep a "depth" counter and BUG() when somebody tries to access the finished strbuf with a hanging-open object or array). Yeah, I thought about that, but I think it gets more complex than that. I'd need a stack of "first_item" values. Or maybe the _begin() needs to increment a depth and set first_item and the _end() needs to always unset first_item. I'll look at this gain. I think you may be able to get by with just unsetting first_item for any "end". Because as you "pop" to whatever data structure is holding whatever has ended, you know it's no longer the first item (whatever just ended was there before it). I admit I haven't thought too hard on it, though, so maybe I'm missing something. I'll take a look. Thanks. The thing I liked about the bottom-up construction is that it is easier to collect multiple sets in parallel and combine them during the final roll-up. With the in-line nesting, you're tempted to try to construct the resulting JSON in a single series and that may not fit what the code is trying to do. For example, if I wanted to collect an array of error messages as they are generated and an array of argv arrays and any alias expansions, then put together a final JSON string containing them and the final exit code, I'd need to build it in parts. I can build these parts in pieces of JSON and combine them at the end -- or build up other similar data structures (string arrays, lists, or whatever) and then have a JSON conversion step. But we can make it work both ways, I just wanted to keep it simpler. Yeah, I agree that kind of bottom-up construction would be nice for some cases. I'm mostly worried about inefficiency copying the strings over and over as we build up the final output. Maybe that's premature worrying, though. If the first_item thing isn't too painful, then it might be nice to have both approaches available. True. In general I'd really prefer to keep the shell script as the driver for the tests, and have t/helper programs just be conduits. E.g., something like: cat >expect <<-\EOF && {"key": "value", "foo": 42} EOF test-json-writer >actual \ object_begin \ object_append_string key value \ object_append_int foo 42 \ object_end && test_cmp expect actual It's a bit tedious (though fairly mechanical) to expose the API in this way, but it makes it much easier to debug, modify, or add tests later on (for example, I had to modify the C program to find out that my append example above wouldn't work). Yeah, I wasn't sure if such a simple api required exposing all that machinery to the shell or not. And the api is fairly self-contained and not depending on a lot of disk/repo setup or anything, so my tests would be essentially static WRT everything else. With my t0019 script you should have been able use -x -v to see what was failing. I was able to run the test-helper directly. The tricky thing is that I had to write new C code to test my theory about how the API worked. Admittedly that's not something most people would do regularly, but I often seem to end up doing that kind of probing and debugging. Many times I've found the more generic t/helper programs useful. I also wonder if various parts of the system embrace JSON, if we'd want to have a tool for generating it as part of other tests (e.g., to create "expect" files). Ok, let me see what I can come up with. Thanks Jeff
Re: [PATCH 0/2] routines to generate JSON data
On Sat, Mar 17, 2018 at 12:00:26AM +0100, Ævar Arnfjörð Bjarmason wrote: > > On Fri, Mar 16 2018, Jeff King jotted: > > > I really like the idea of being able to send our machine-readable output > > in some "standard" syntax for which people may already have parsers. But > > one big hangup with JSON is that it assumes all strings are UTF-8. > > FWIW It's not UTF-8 but "Unicode characters", i.e. any Unicode encoding > is valid, not that it changes anything you're pointing out, but people > on Win32 could use UTF-16 as-is if their filenames were in that format. But AIUI, non-UTF8 has to come as "\u" escapes, right? That at least gives us an "out" for exotic characters, but I don't think we can just blindly dump pathnames into quoted strings, can we? > > Some possible solutions I can think of: > > > > 1. Ignore the UTF-8 requirement, making a JSON-like output (which I > > think is what your patches do). I'm not sure what problems this > > might cause on the parsing side. > > Maybe some JSON parsers are more permissive, but they'll commonly just > die on non-Unicode (usually UTF-8) input, e.g.: > > $ (echo -n '{"str ": "'; head -c 3 /dev/urandom ; echo -n '"}') | perl > -0666 -MJSON::XS -wE 'say decode_json(<>)->{str}' > malformed UTF-8 character in JSON string, at character offset 10 (before > "\x{fffd}e\x{fffd}"}") at -e line 1, <> chunk 1. OK, that's about what I expected. > > 2. Specially encode non-UTF-8 bits. I'm not familiar enough with JSON > > to know the options here, but my understanding is that numeric > > escapes are just for inserting unicode code points. _Can_ you > > actually transport arbitrary binary data across JSON without > > base64-encoding it (yech)? > > There's no way to transfer binary data in JSON without it being shoved > into a UTF-8 encoding, so you'd need to know on the other side that > such-and-such a field has binary in it, i.e. you'll need to invent your > own schema. Yuck. That's what I was afraid of. Is there any kind of standard scheme here? It seems like we lose all of the benefits of JSON if the receiver has to know whether and when to de-base64 (or whatever) our data. > I think for git's use-case we're probably best off with JSON. It's going > to work almost all of the time, and when it doesn't it's going to be on > someone's weird non-UTF-8 repo, and those people are probably used to > dealing with crap because of that anyway and can just manually decode > their thing after it gets double-encoded. That sounds a bit hand-wavy. While I agree that anybody using non-utf8 at this point is slightly insane, Git _does_ actually work with arbitrary encodings in things like pathnames. It just seems kind of lame to settle on a new universal encoding format for output that's actually less capable than the current output. > That sucks, but given that we'll be using this either for just ASCII > (telemetry) or UTF-8 most of the time, and that realistically other > formats either suck more or aren't nearly as ubiquitous... I'd hoped to be able to output something like "git status" in JSON, which is inherently going to deal with user paths. -Peff
Re: [PATCH 0/2] routines to generate JSON data
On Mon, Mar 19, 2018 at 06:19:26AM -0400, Jeff Hostetler wrote: > > To make the above work, I think you'd have to store a little more state. > > E.g., the "array_append" functions check "out->len" to see if they need > > to add a separating comma. That wouldn't work if we might be part of a > > nested array. So I think you'd need a context struct like: > > > >struct json_writer { > > int first_item; > > struct strbuf out; > >}; > >#define JSON_WRITER_INIT { 1, STRBUF_INIT } > > > > to store the state and the output. As a bonus, you could also use it to > > store some other sanity checks (e.g., keep a "depth" counter and BUG() > > when somebody tries to access the finished strbuf with a hanging-open > > object or array). > > Yeah, I thought about that, but I think it gets more complex than that. > I'd need a stack of "first_item" values. Or maybe the _begin() needs to > increment a depth and set first_item and the _end() needs to always > unset first_item. I'll look at this gain. I think you may be able to get by with just unsetting first_item for any "end". Because as you "pop" to whatever data structure is holding whatever has ended, you know it's no longer the first item (whatever just ended was there before it). I admit I haven't thought too hard on it, though, so maybe I'm missing something. > The thing I liked about the bottom-up construction is that it is easier > to collect multiple sets in parallel and combine them during the final > roll-up. With the in-line nesting, you're tempted to try to construct > the resulting JSON in a single series and that may not fit what the code > is trying to do. For example, if I wanted to collect an array of error > messages as they are generated and an array of argv arrays and any alias > expansions, then put together a final JSON string containing them and > the final exit code, I'd need to build it in parts. I can build these > parts in pieces of JSON and combine them at the end -- or build up other > similar data structures (string arrays, lists, or whatever) and then > have a JSON conversion step. But we can make it work both ways, I just > wanted to keep it simpler. Yeah, I agree that kind of bottom-up construction would be nice for some cases. I'm mostly worried about inefficiency copying the strings over and over as we build up the final output. Maybe that's premature worrying, though. If the first_item thing isn't too painful, then it might be nice to have both approaches available. > > In general I'd really prefer to keep the shell script as the driver for > > the tests, and have t/helper programs just be conduits. E.g., something > > like: > > > >cat >expect <<-\EOF && > >{"key": "value", "foo": 42} > >EOF > >test-json-writer >actual \ > > object_begin \ > > object_append_string key value \ > > object_append_int foo 42 \ > > object_end && > >test_cmp expect actual > > > > It's a bit tedious (though fairly mechanical) to expose the API in this > > way, but it makes it much easier to debug, modify, or add tests later > > on (for example, I had to modify the C program to find out that my > > append example above wouldn't work). > > Yeah, I wasn't sure if such a simple api required exposing all that > machinery to the shell or not. And the api is fairly self-contained > and not depending on a lot of disk/repo setup or anything, so my tests > would be essentially static WRT everything else. > > With my t0019 script you should have been able use -x -v to see what > was failing. I was able to run the test-helper directly. The tricky thing is that I had to write new C code to test my theory about how the API worked. Admittedly that's not something most people would do regularly, but I often seem to end up doing that kind of probing and debugging. Many times I've found the more generic t/helper programs useful. I also wonder if various parts of the system embrace JSON, if we'd want to have a tool for generating it as part of other tests (e.g., to create "expect" files). -Peff
Re: [PATCH 0/2] routines to generate JSON data
On 3/17/2018 3:38 AM, Jacob Keller wrote: On Fri, Mar 16, 2018 at 2:18 PM, Jeff Kingwrote: 3. Some other similar format. YAML comes to mind. Last time I looked (quite a while ago), it seemed insanely complex, but I think you could implement only a reasonable subset. OTOH, I think the tools ecosystem for parsing JSON (e.g., jq) is much better. I would personally avoid YAML. It's "easier" for humans to read/parse, but honestly JSON is already simple enough and anyone who writes C or javascript can likely parse and hand-write JSON anyways. YAML lacks built-in parsers for most languages, where as many scripting languages already have JSON parsing built in, or have more easily attainable libraries available. In contrast, the YAML libraries are much more complex and less likely to be available. That's just my own experience at $dayjob though. Agreed. I just looked at the spec for it and I think it would be harder for us to be assured we are generating valid output with leading whitespace being significant (without a lot more inspection of the strings being passed down to us). Jeff
Re: [PATCH 0/2] routines to generate JSON data
On 3/16/2018 5:18 PM, Jeff King wrote: On Fri, Mar 16, 2018 at 07:40:55PM +, g...@jeffhostetler.com wrote: [...] I really like the idea of being able to send our machine-readable output in some "standard" syntax for which people may already have parsers. But one big hangup with JSON is that it assumes all strings are UTF-8. That may be OK for telemetry data, but it would probably lead to problems for something like status porcelain, since Git's view of paths is just a string of bytes (not to mention possible uses elsewhere like author names, subject lines, etc). [...] I'll come back to the UTF-8/YAML questions in a separate response. Documentation for the new API is given in json-writer.h at the bottom of the first patch. The API generally looks pleasant, but the nesting surprised me. E.g., I'd have expected: jw_array_begin(out); jw_array_begin(out); jw_array_append_int(out, 42); jw_array_end(out); jw_array_end(out); to result in an array containing an array containing an integer. But array_begin() actually resets the strbuf, so you can't build up nested items like this internally. Ditto for objects within objects. You have to use two separate strbufs and copy the data an extra time. To make the above work, I think you'd have to store a little more state. E.g., the "array_append" functions check "out->len" to see if they need to add a separating comma. That wouldn't work if we might be part of a nested array. So I think you'd need a context struct like: struct json_writer { int first_item; struct strbuf out; }; #define JSON_WRITER_INIT { 1, STRBUF_INIT } to store the state and the output. As a bonus, you could also use it to store some other sanity checks (e.g., keep a "depth" counter and BUG() when somebody tries to access the finished strbuf with a hanging-open object or array). Yeah, I thought about that, but I think it gets more complex than that. I'd need a stack of "first_item" values. Or maybe the _begin() needs to increment a depth and set first_item and the _end() needs to always unset first_item. I'll look at this gain. The thing I liked about the bottom-up construction is that it is easier to collect multiple sets in parallel and combine them during the final roll-up. With the in-line nesting, you're tempted to try to construct the resulting JSON in a single series and that may not fit what the code is trying to do. For example, if I wanted to collect an array of error messages as they are generated and an array of argv arrays and any alias expansions, then put together a final JSON string containing them and the final exit code, I'd need to build it in parts. I can build these parts in pieces of JSON and combine them at the end -- or build up other similar data structures (string arrays, lists, or whatever) and then have a JSON conversion step. But we can make it work both ways, I just wanted to keep it simpler. I wasn't sure how to unit test the API from a shell script, so I added a helper command that does most of the work in the second patch. In general I'd really prefer to keep the shell script as the driver for the tests, and have t/helper programs just be conduits. E.g., something like: cat >expect <<-\EOF && {"key": "value", "foo": 42} EOF test-json-writer >actual \ object_begin \ object_append_string key value \ object_append_int foo 42 \ object_end && test_cmp expect actual It's a bit tedious (though fairly mechanical) to expose the API in this way, but it makes it much easier to debug, modify, or add tests later on (for example, I had to modify the C program to find out that my append example above wouldn't work). Yeah, I wasn't sure if such a simple api required exposing all that machinery to the shell or not. And the api is fairly self-contained and not depending on a lot of disk/repo setup or anything, so my tests would be essentially static WRT everything else. With my t0019 script you should have been able use -x -v to see what was failing. -Peff thanks for the quick review Jeff
Re: [PATCH 0/2] routines to generate JSON data
On Fri, Mar 16, 2018 at 2:18 PM, Jeff Kingwrote: > 3. Some other similar format. YAML comes to mind. Last time I looked > (quite a while ago), it seemed insanely complex, but I think you > could implement only a reasonable subset. OTOH, I think the tools > ecosystem for parsing JSON (e.g., jq) is much better. > I would personally avoid YAML. It's "easier" for humans to read/parse, but honestly JSON is already simple enough and anyone who writes C or javascript can likely parse and hand-write JSON anyways. YAML lacks built-in parsers for most languages, where as many scripting languages already have JSON parsing built in, or have more easily attainable libraries available. In contrast, the YAML libraries are much more complex and less likely to be available. That's just my own experience at $dayjob though. Thanks, Jake
Re: [PATCH 0/2] routines to generate JSON data
On Fri, Mar 16 2018, Jeff King jotted: > I really like the idea of being able to send our machine-readable output > in some "standard" syntax for which people may already have parsers. But > one big hangup with JSON is that it assumes all strings are UTF-8. FWIW It's not UTF-8 but "Unicode characters", i.e. any Unicode encoding is valid, not that it changes anything you're pointing out, but people on Win32 could use UTF-16 as-is if their filenames were in that format. I'm just going to use UTF-8 synonymously with "Unicode encoding" for the rest of this mail... > Some possible solutions I can think of: > > 1. Ignore the UTF-8 requirement, making a JSON-like output (which I > think is what your patches do). I'm not sure what problems this > might cause on the parsing side. Maybe some JSON parsers are more permissive, but they'll commonly just die on non-Unicode (usually UTF-8) input, e.g.: $ (echo -n '{"str ": "'; head -c 3 /dev/urandom ; echo -n '"}') | perl -0666 -MJSON::XS -wE 'say decode_json(<>)->{str}' malformed UTF-8 character in JSON string, at character offset 10 (before "\x{fffd}e\x{fffd}"}") at -e line 1, <> chunk 1. > 2. Specially encode non-UTF-8 bits. I'm not familiar enough with JSON > to know the options here, but my understanding is that numeric > escapes are just for inserting unicode code points. _Can_ you > actually transport arbitrary binary data across JSON without > base64-encoding it (yech)? There's no way to transfer binary data in JSON without it being shoved into a UTF-8 encoding, so you'd need to know on the other side that such-and-such a field has binary in it, i.e. you'll need to invent your own schema. E.g.: head -c 10 /dev/urandom | perl -MDevel::Peek -MJSON::XS -wE 'my $in = ; my $roundtrip = decode_json(encode_json({str => $in}))->{str}; utf8::decode($roundtrip) if $ARGV[0]; say Dump [$in, $roundtrip]' 0 You can tweak that trailing "0" to "1" to toggle the ad-hoc schema, i.e. after we decode the JSON we go and manually UTF-8 decode it to get back at the same binary data, otherwise we end up with an UTF-8 escaped version of what we put in. > 3. Some other similar format. YAML comes to mind. Last time I looked > (quite a while ago), it seemed insanely complex, but I think you > could implement only a reasonable subset. OTOH, I think the tools > ecosystem for parsing JSON (e.g., jq) is much better. The lack of fast schema-less formats that supported arrays, hashes etc. and didn't suck when it came to mixed binary/UTF-8 led us to implementing our own at work: https://github.com/Sereal/Sereal I think for git's use-case we're probably best off with JSON. It's going to work almost all of the time, and when it doesn't it's going to be on someone's weird non-UTF-8 repo, and those people are probably used to dealing with crap because of that anyway and can just manually decode their thing after it gets double-encoded. That sucks, but given that we'll be using this either for just ASCII (telemetry) or UTF-8 most of the time, and that realistically other formats either suck more or aren't nearly as ubiquitous...
Re: [PATCH 0/2] routines to generate JSON data
On Fri, Mar 16, 2018 at 07:40:55PM +, g...@jeffhostetler.com wrote: > This patch series adds a set of utility routines to compose data in JSON > format into a "struct strbuf". The resulting string can then be output > by commands wanting to support a JSON output format. > > This is a stand alone patch. Nothing currently uses these routines. I'm > currently working on a series to log "telemetry" data (as we discussed > briefly during Ævar's "Performance Misc" session [1] in Barcelona last > week). And I want emit the data in JSON rather than a fixed column/field > format. The JSON routines here are independent of that, so it made sense > to submit the JSON part by itself. > > Back when we added porcelain=v2 format to status, we talked about adding a > JSON format. I think the routines in this patch would let us easily do > that, if someone were interested. (Extending status is not on my radar > right now, however.) I really like the idea of being able to send our machine-readable output in some "standard" syntax for which people may already have parsers. But one big hangup with JSON is that it assumes all strings are UTF-8. That may be OK for telemetry data, but it would probably lead to problems for something like status porcelain, since Git's view of paths is just a string of bytes (not to mention possible uses elsewhere like author names, subject lines, etc). Before we commit to a standardized format, I think we need to work out a solution there (because I'd much rather not go down this road for telemetry data only to find that we cannot use the same standardized format in other parts of Git). Some possible solutions I can think of: 1. Ignore the UTF-8 requirement, making a JSON-like output (which I think is what your patches do). I'm not sure what problems this might cause on the parsing side. 2. Specially encode non-UTF-8 bits. I'm not familiar enough with JSON to know the options here, but my understanding is that numeric escapes are just for inserting unicode code points. _Can_ you actually transport arbitrary binary data across JSON without base64-encoding it (yech)? 3. Some other similar format. YAML comes to mind. Last time I looked (quite a while ago), it seemed insanely complex, but I think you could implement only a reasonable subset. OTOH, I think the tools ecosystem for parsing JSON (e.g., jq) is much better. > Documentation for the new API is given in json-writer.h at the bottom of > the first patch. The API generally looks pleasant, but the nesting surprised me. E.g., I'd have expected: jw_array_begin(out); jw_array_begin(out); jw_array_append_int(out, 42); jw_array_end(out); jw_array_end(out); to result in an array containing an array containing an integer. But array_begin() actually resets the strbuf, so you can't build up nested items like this internally. Ditto for objects within objects. You have to use two separate strbufs and copy the data an extra time. To make the above work, I think you'd have to store a little more state. E.g., the "array_append" functions check "out->len" to see if they need to add a separating comma. That wouldn't work if we might be part of a nested array. So I think you'd need a context struct like: struct json_writer { int first_item; struct strbuf out; }; #define JSON_WRITER_INIT { 1, STRBUF_INIT } to store the state and the output. As a bonus, you could also use it to store some other sanity checks (e.g., keep a "depth" counter and BUG() when somebody tries to access the finished strbuf with a hanging-open object or array). > I wasn't sure how to unit test the API from a shell script, so I added a > helper command that does most of the work in the second patch. In general I'd really prefer to keep the shell script as the driver for the tests, and have t/helper programs just be conduits. E.g., something like: cat >expect <<-\EOF && {"key": "value", "foo": 42} EOF test-json-writer >actual \ object_begin \ object_append_string key value \ object_append_int foo 42 \ object_end && test_cmp expect actual It's a bit tedious (though fairly mechanical) to expose the API in this way, but it makes it much easier to debug, modify, or add tests later on (for example, I had to modify the C program to find out that my append example above wouldn't work). -Peff
[PATCH 0/2] routines to generate JSON data
From: Jeff HostetlerThis patch series adds a set of utility routines to compose data in JSON format into a "struct strbuf". The resulting string can then be output by commands wanting to support a JSON output format. This is a stand alone patch. Nothing currently uses these routines. I'm currently working on a series to log "telemetry" data (as we discussed briefly during Ævar's "Performance Misc" session [1] in Barcelona last week). And I want emit the data in JSON rather than a fixed column/field format. The JSON routines here are independent of that, so it made sense to submit the JSON part by itself. Back when we added porcelain=v2 format to status, we talked about adding a JSON format. I think the routines in this patch would let us easily do that, if someone were interested. (Extending status is not on my radar right now, however.) Documentation for the new API is given in json-writer.h at the bottom of the first patch. I wasn't sure how to unit test the API from a shell script, so I added a helper command that does most of the work in the second patch. [1] https://public-inbox.org/git/20180313004940.gg61...@google.com/T/ Jeff Hostetler (2): json_writer: new routines to create data in JSON format json-writer: unit test Makefile| 2 + json-writer.c | 224 json-writer.h | 120 t/helper/test-json-writer.c | 146 + t/t0019-json-writer.sh | 10 ++ 5 files changed, 502 insertions(+) create mode 100644 json-writer.c create mode 100644 json-writer.h create mode 100644 t/helper/test-json-writer.c create mode 100755 t/t0019-json-writer.sh -- 2.9.3