Re: [PATCH 0/2] routines to generate JSON data

2018-03-20 Thread Jeff Hostetler



On 3/20/2018 1:42 AM, Jeff King wrote:

On Mon, Mar 19, 2018 at 06:19:26AM -0400, Jeff Hostetler wrote:


To make the above work, I think you'd have to store a little more state.
E.g., the "array_append" functions check "out->len" to see if they need
to add a separating comma. That wouldn't work if we might be part of a
nested array. So I think you'd need a context struct like:

struct json_writer {
  int first_item;
  struct strbuf out;
};
#define JSON_WRITER_INIT { 1, STRBUF_INIT }

to store the state and the output. As a bonus, you could also use it to
store some other sanity checks (e.g., keep a "depth" counter and BUG()
when somebody tries to access the finished strbuf with a hanging-open
object or array).


Yeah, I thought about that, but I think it gets more complex than that.
I'd need a stack of "first_item" values.  Or maybe the _begin() needs to
increment a depth and set first_item and the _end() needs to always
unset first_item.  I'll look at this gain.


I think you may be able to get by with just unsetting first_item for any
"end". Because as you "pop" to whatever data structure is holding
whatever has ended, you know it's no longer the first item (whatever
just ended was there before it).

I admit I haven't thought too hard on it, though, so maybe I'm missing
something.


I'll take a look.  Thanks.

 

The thing I liked about the bottom-up construction is that it is easier
to collect multiple sets in parallel and combine them during the final
roll-up.  With the in-line nesting, you're tempted to try to construct
the resulting JSON in a single series and that may not fit what the code
is trying to do.  For example, if I wanted to collect an array of error
messages as they are generated and an array of argv arrays and any alias
expansions, then put together a final JSON string containing them and
the final exit code, I'd need to build it in parts.  I can build these
parts in pieces of JSON and combine them at the end -- or build up other
similar data structures (string arrays, lists, or whatever) and then
have a JSON conversion step.  But we can make it work both ways, I just
wanted to keep it simpler.


Yeah, I agree that kind of bottom-up construction would be nice for some
cases. I'm mostly worried about inefficiency copying the strings over
and over as we build up the final output.  Maybe that's premature
worrying, though.

If the first_item thing isn't too painful, then it might be nice to have
both approaches available.


True.

 

In general I'd really prefer to keep the shell script as the driver for
the tests, and have t/helper programs just be conduits. E.g., something
like:

cat >expect <<-\EOF &&
{"key": "value", "foo": 42}
EOF
test-json-writer >actual \
  object_begin \
  object_append_string key value \
  object_append_int foo 42 \
  object_end &&
test_cmp expect actual

It's a bit tedious (though fairly mechanical) to expose the API in this
way, but it makes it much easier to debug, modify, or add tests later
on (for example, I had to modify the C program to find out that my
append example above wouldn't work).


Yeah, I wasn't sure if such a simple api required exposing all that
machinery to the shell or not.  And the api is fairly self-contained
and not depending on a lot of disk/repo setup or anything, so my tests
would be essentially static WRT everything else.

With my t0019 script you should have been able use -x -v to see what
was failing.


I was able to run the test-helper directly. The tricky thing is that I
had to write new C code to test my theory about how the API worked.
Admittedly that's not something most people would do regularly, but I
often seem to end up doing that kind of probing and debugging. Many
times I've found the more generic t/helper programs useful.

I also wonder if various parts of the system embrace JSON, if we'd want
to have a tool for generating it as part of other tests (e.g., to create
"expect" files).


Ok, let me see what I can come up with.

Thanks
Jeff



Re: [PATCH 0/2] routines to generate JSON data

2018-03-19 Thread Jeff King
On Sat, Mar 17, 2018 at 12:00:26AM +0100, Ævar Arnfjörð Bjarmason wrote:

> 
> On Fri, Mar 16 2018, Jeff King jotted:
> 
> > I really like the idea of being able to send our machine-readable output
> > in some "standard" syntax for which people may already have parsers. But
> > one big hangup with JSON is that it assumes all strings are UTF-8.
> 
> FWIW It's not UTF-8 but "Unicode characters", i.e. any Unicode encoding
> is valid, not that it changes anything you're pointing out, but people
> on Win32 could use UTF-16 as-is if their filenames were in that format.

But AIUI, non-UTF8 has to come as "\u" escapes, right? That at least
gives us an "out" for exotic characters, but I don't think we can just
blindly dump pathnames into quoted strings, can we?

> > Some possible solutions I can think of:
> >
> >   1. Ignore the UTF-8 requirement, making a JSON-like output (which I
> >  think is what your patches do). I'm not sure what problems this
> >  might cause on the parsing side.
> 
> Maybe some JSON parsers are more permissive, but they'll commonly just
> die on non-Unicode (usually UTF-8) input, e.g.:
> 
> $ (echo -n '{"str ": "'; head -c 3 /dev/urandom ; echo -n '"}') | perl 
> -0666 -MJSON::XS -wE 'say decode_json(<>)->{str}'
> malformed UTF-8 character in JSON string, at character offset 10 (before 
> "\x{fffd}e\x{fffd}"}") at -e line 1, <> chunk 1.

OK, that's about what I expected.

> >   2. Specially encode non-UTF-8 bits. I'm not familiar enough with JSON
> >  to know the options here, but my understanding is that numeric
> >  escapes are just for inserting unicode code points. _Can_ you
> >  actually transport arbitrary binary data across JSON without
> >  base64-encoding it (yech)?
> 
> There's no way to transfer binary data in JSON without it being shoved
> into a UTF-8 encoding, so you'd need to know on the other side that
> such-and-such a field has binary in it, i.e. you'll need to invent your
> own schema.

Yuck. That's what I was afraid of. Is there any kind of standard scheme
here? It seems like we lose all of the benefits of JSON if the receiver
has to know whether and when to de-base64 (or whatever) our data.

> I think for git's use-case we're probably best off with JSON. It's going
> to work almost all of the time, and when it doesn't it's going to be on
> someone's weird non-UTF-8 repo, and those people are probably used to
> dealing with crap because of that anyway and can just manually decode
> their thing after it gets double-encoded.

That sounds a bit hand-wavy. While I agree that anybody using non-utf8
at this point is slightly insane, Git _does_ actually work with
arbitrary encodings in things like pathnames. It just seems kind of lame
to settle on a new universal encoding format for output that's actually
less capable than the current output.

> That sucks, but given that we'll be using this either for just ASCII
> (telemetry) or UTF-8 most of the time, and that realistically other
> formats either suck more or aren't nearly as ubiquitous...

I'd hoped to be able to output something like "git status" in JSON,
which is inherently going to deal with user paths.

-Peff


Re: [PATCH 0/2] routines to generate JSON data

2018-03-19 Thread Jeff King
On Mon, Mar 19, 2018 at 06:19:26AM -0400, Jeff Hostetler wrote:

> > To make the above work, I think you'd have to store a little more state.
> > E.g., the "array_append" functions check "out->len" to see if they need
> > to add a separating comma. That wouldn't work if we might be part of a
> > nested array. So I think you'd need a context struct like:
> > 
> >struct json_writer {
> >  int first_item;
> >  struct strbuf out;
> >};
> >#define JSON_WRITER_INIT { 1, STRBUF_INIT }
> > 
> > to store the state and the output. As a bonus, you could also use it to
> > store some other sanity checks (e.g., keep a "depth" counter and BUG()
> > when somebody tries to access the finished strbuf with a hanging-open
> > object or array).
> 
> Yeah, I thought about that, but I think it gets more complex than that.
> I'd need a stack of "first_item" values.  Or maybe the _begin() needs to
> increment a depth and set first_item and the _end() needs to always
> unset first_item.  I'll look at this gain.

I think you may be able to get by with just unsetting first_item for any
"end". Because as you "pop" to whatever data structure is holding
whatever has ended, you know it's no longer the first item (whatever
just ended was there before it).

I admit I haven't thought too hard on it, though, so maybe I'm missing
something.

> The thing I liked about the bottom-up construction is that it is easier
> to collect multiple sets in parallel and combine them during the final
> roll-up.  With the in-line nesting, you're tempted to try to construct
> the resulting JSON in a single series and that may not fit what the code
> is trying to do.  For example, if I wanted to collect an array of error
> messages as they are generated and an array of argv arrays and any alias
> expansions, then put together a final JSON string containing them and
> the final exit code, I'd need to build it in parts.  I can build these
> parts in pieces of JSON and combine them at the end -- or build up other
> similar data structures (string arrays, lists, or whatever) and then
> have a JSON conversion step.  But we can make it work both ways, I just
> wanted to keep it simpler.

Yeah, I agree that kind of bottom-up construction would be nice for some
cases. I'm mostly worried about inefficiency copying the strings over
and over as we build up the final output.  Maybe that's premature
worrying, though.

If the first_item thing isn't too painful, then it might be nice to have
both approaches available.

> > In general I'd really prefer to keep the shell script as the driver for
> > the tests, and have t/helper programs just be conduits. E.g., something
> > like:
> > 
> >cat >expect <<-\EOF &&
> >{"key": "value", "foo": 42}
> >EOF
> >test-json-writer >actual \
> >  object_begin \
> >  object_append_string key value \
> >  object_append_int foo 42 \
> >  object_end &&
> >test_cmp expect actual
> > 
> > It's a bit tedious (though fairly mechanical) to expose the API in this
> > way, but it makes it much easier to debug, modify, or add tests later
> > on (for example, I had to modify the C program to find out that my
> > append example above wouldn't work).
> 
> Yeah, I wasn't sure if such a simple api required exposing all that
> machinery to the shell or not.  And the api is fairly self-contained
> and not depending on a lot of disk/repo setup or anything, so my tests
> would be essentially static WRT everything else.
> 
> With my t0019 script you should have been able use -x -v to see what
> was failing.

I was able to run the test-helper directly. The tricky thing is that I
had to write new C code to test my theory about how the API worked.
Admittedly that's not something most people would do regularly, but I
often seem to end up doing that kind of probing and debugging. Many
times I've found the more generic t/helper programs useful.

I also wonder if various parts of the system embrace JSON, if we'd want
to have a tool for generating it as part of other tests (e.g., to create
"expect" files).

-Peff


Re: [PATCH 0/2] routines to generate JSON data

2018-03-19 Thread Jeff Hostetler



On 3/17/2018 3:38 AM, Jacob Keller wrote:

On Fri, Mar 16, 2018 at 2:18 PM, Jeff King  wrote:

   3. Some other similar format. YAML comes to mind. Last time I looked
  (quite a while ago), it seemed insanely complex, but I think you
  could implement only a reasonable subset. OTOH, I think the tools
  ecosystem for parsing JSON (e.g., jq) is much better.



I would personally avoid YAML. It's "easier" for humans to read/parse,
but honestly JSON is already simple enough and anyone who writes C or
javascript can likely parse and hand-write JSON anyways. YAML lacks
built-in parsers for most languages, where as many scripting languages
already have JSON parsing built in, or have more easily attainable
libraries available. In contrast, the YAML libraries are much more
complex and less likely to be available.

That's just my own experience at $dayjob though.


Agreed.  I just looked at the spec for it and I think it would be
harder for us to be assured we are generating valid output with
leading whitespace being significant (without a lot more inspection
of the strings being passed down to us).

Jeff



Re: [PATCH 0/2] routines to generate JSON data

2018-03-19 Thread Jeff Hostetler



On 3/16/2018 5:18 PM, Jeff King wrote:

On Fri, Mar 16, 2018 at 07:40:55PM +, g...@jeffhostetler.com wrote:


[...]

I really like the idea of being able to send our machine-readable output
in some "standard" syntax for which people may already have parsers. But
one big hangup with JSON is that it assumes all strings are UTF-8. That
may be OK for telemetry data, but it would probably lead to problems for
something like status porcelain, since Git's view of paths is just a
string of bytes (not to mention possible uses elsewhere like author
names, subject lines, etc).

[...]

I'll come back to the UTF-8/YAML questions in a separate response.



Documentation for the new API is given in json-writer.h at the bottom of
the first patch.


The API generally looks pleasant, but the nesting surprised me.  E.g.,
I'd have expected:

   jw_array_begin(out);
   jw_array_begin(out);
   jw_array_append_int(out, 42);
   jw_array_end(out);
   jw_array_end(out);

to result in an array containing an array containing an integer. But
array_begin() actually resets the strbuf, so you can't build up nested
items like this internally. Ditto for objects within objects. You have
to use two separate strbufs and copy the data an extra time.

To make the above work, I think you'd have to store a little more state.
E.g., the "array_append" functions check "out->len" to see if they need
to add a separating comma. That wouldn't work if we might be part of a
nested array. So I think you'd need a context struct like:

   struct json_writer {
 int first_item;
 struct strbuf out;
   };
   #define JSON_WRITER_INIT { 1, STRBUF_INIT }

to store the state and the output. As a bonus, you could also use it to
store some other sanity checks (e.g., keep a "depth" counter and BUG()
when somebody tries to access the finished strbuf with a hanging-open
object or array).


Yeah, I thought about that, but I think it gets more complex than that.
I'd need a stack of "first_item" values.  Or maybe the _begin() needs to
increment a depth and set first_item and the _end() needs to always
unset first_item.  I'll look at this gain.

The thing I liked about the bottom-up construction is that it is easier
to collect multiple sets in parallel and combine them during the final
roll-up.  With the in-line nesting, you're tempted to try to construct
the resulting JSON in a single series and that may not fit what the code
is trying to do.  For example, if I wanted to collect an array of error
messages as they are generated and an array of argv arrays and any alias
expansions, then put together a final JSON string containing them and
the final exit code, I'd need to build it in parts.  I can build these
parts in pieces of JSON and combine them at the end -- or build up other
similar data structures (string arrays, lists, or whatever) and then
have a JSON conversion step.  But we can make it work both ways, I just
wanted to keep it simpler.



I wasn't sure how to unit test the API from a shell script, so I added a
helper command that does most of the work in the second patch.


In general I'd really prefer to keep the shell script as the driver for
the tests, and have t/helper programs just be conduits. E.g., something
like:

   cat >expect <<-\EOF &&
   {"key": "value", "foo": 42}
   EOF
   test-json-writer >actual \
 object_begin \
 object_append_string key value \
 object_append_int foo 42 \
 object_end &&
   test_cmp expect actual

It's a bit tedious (though fairly mechanical) to expose the API in this
way, but it makes it much easier to debug, modify, or add tests later
on (for example, I had to modify the C program to find out that my
append example above wouldn't work).


Yeah, I wasn't sure if such a simple api required exposing all that
machinery to the shell or not.  And the api is fairly self-contained
and not depending on a lot of disk/repo setup or anything, so my tests
would be essentially static WRT everything else.

With my t0019 script you should have been able use -x -v to see what
was failing.



-Peff



thanks for the quick review
Jeff


Re: [PATCH 0/2] routines to generate JSON data

2018-03-17 Thread Jacob Keller
On Fri, Mar 16, 2018 at 2:18 PM, Jeff King  wrote:
>   3. Some other similar format. YAML comes to mind. Last time I looked
>  (quite a while ago), it seemed insanely complex, but I think you
>  could implement only a reasonable subset. OTOH, I think the tools
>  ecosystem for parsing JSON (e.g., jq) is much better.
>

I would personally avoid YAML. It's "easier" for humans to read/parse,
but honestly JSON is already simple enough and anyone who writes C or
javascript can likely parse and hand-write JSON anyways. YAML lacks
built-in parsers for most languages, where as many scripting languages
already have JSON parsing built in, or have more easily attainable
libraries available. In contrast, the YAML libraries are much more
complex and less likely to be available.

That's just my own experience at $dayjob though.

Thanks,
Jake


Re: [PATCH 0/2] routines to generate JSON data

2018-03-16 Thread Ævar Arnfjörð Bjarmason

On Fri, Mar 16 2018, Jeff King jotted:

> I really like the idea of being able to send our machine-readable output
> in some "standard" syntax for which people may already have parsers. But
> one big hangup with JSON is that it assumes all strings are UTF-8.

FWIW It's not UTF-8 but "Unicode characters", i.e. any Unicode encoding
is valid, not that it changes anything you're pointing out, but people
on Win32 could use UTF-16 as-is if their filenames were in that format.

I'm just going to use UTF-8 synonymously with "Unicode encoding" for the
rest of this mail...

> Some possible solutions I can think of:
>
>   1. Ignore the UTF-8 requirement, making a JSON-like output (which I
>  think is what your patches do). I'm not sure what problems this
>  might cause on the parsing side.

Maybe some JSON parsers are more permissive, but they'll commonly just
die on non-Unicode (usually UTF-8) input, e.g.:

$ (echo -n '{"str ": "'; head -c 3 /dev/urandom ; echo -n '"}') | perl 
-0666 -MJSON::XS -wE 'say decode_json(<>)->{str}'
malformed UTF-8 character in JSON string, at character offset 10 (before 
"\x{fffd}e\x{fffd}"}") at -e line 1, <> chunk 1.

>   2. Specially encode non-UTF-8 bits. I'm not familiar enough with JSON
>  to know the options here, but my understanding is that numeric
>  escapes are just for inserting unicode code points. _Can_ you
>  actually transport arbitrary binary data across JSON without
>  base64-encoding it (yech)?

There's no way to transfer binary data in JSON without it being shoved
into a UTF-8 encoding, so you'd need to know on the other side that
such-and-such a field has binary in it, i.e. you'll need to invent your
own schema.

E.g.:

head -c 10 /dev/urandom | perl -MDevel::Peek -MJSON::XS -wE 'my $in = 
; my $roundtrip = decode_json(encode_json({str => $in}))->{str}; 
utf8::decode($roundtrip) if $ARGV[0]; say Dump [$in, $roundtrip]' 0

You can tweak that trailing "0" to "1" to toggle the ad-hoc schema,
i.e. after we decode the JSON we go and manually UTF-8 decode it to get
back at the same binary data, otherwise we end up with an UTF-8 escaped
version of what we put in.

>   3. Some other similar format. YAML comes to mind. Last time I looked
>  (quite a while ago), it seemed insanely complex, but I think you
>  could implement only a reasonable subset. OTOH, I think the tools
>  ecosystem for parsing JSON (e.g., jq) is much better.

The lack of fast schema-less formats that supported arrays, hashes
etc. and didn't suck when it came to mixed binary/UTF-8 led us to
implementing our own at work: https://github.com/Sereal/Sereal

I think for git's use-case we're probably best off with JSON. It's going
to work almost all of the time, and when it doesn't it's going to be on
someone's weird non-UTF-8 repo, and those people are probably used to
dealing with crap because of that anyway and can just manually decode
their thing after it gets double-encoded.

That sucks, but given that we'll be using this either for just ASCII
(telemetry) or UTF-8 most of the time, and that realistically other
formats either suck more or aren't nearly as ubiquitous...


Re: [PATCH 0/2] routines to generate JSON data

2018-03-16 Thread Jeff King
On Fri, Mar 16, 2018 at 07:40:55PM +, g...@jeffhostetler.com wrote:

> This patch series adds a set of utility routines to compose data in JSON
> format into a "struct strbuf".  The resulting string can then be output
> by commands wanting to support a JSON output format.
> 
> This is a stand alone patch.  Nothing currently uses these routines.  I'm
> currently working on a series to log "telemetry" data (as we discussed
> briefly during Ævar's "Performance Misc" session [1] in Barcelona last
> week).  And I want emit the data in JSON rather than a fixed column/field
> format.  The JSON routines here are independent of that, so it made sense
> to submit the JSON part by itself.
> 
> Back when we added porcelain=v2 format to status, we talked about adding a
> JSON format.  I think the routines in this patch would let us easily do
> that, if someone were interested.  (Extending status is not on my radar
> right now, however.)

I really like the idea of being able to send our machine-readable output
in some "standard" syntax for which people may already have parsers. But
one big hangup with JSON is that it assumes all strings are UTF-8. That
may be OK for telemetry data, but it would probably lead to problems for
something like status porcelain, since Git's view of paths is just a
string of bytes (not to mention possible uses elsewhere like author
names, subject lines, etc).

Before we commit to a standardized format, I think we need to work out
a solution there (because I'd much rather not go down this road for
telemetry data only to find that we cannot use the same standardized
format in other parts of Git).

Some possible solutions I can think of:

  1. Ignore the UTF-8 requirement, making a JSON-like output (which I
 think is what your patches do). I'm not sure what problems this
 might cause on the parsing side.

  2. Specially encode non-UTF-8 bits. I'm not familiar enough with JSON
 to know the options here, but my understanding is that numeric
 escapes are just for inserting unicode code points. _Can_ you
 actually transport arbitrary binary data across JSON without
 base64-encoding it (yech)?

  3. Some other similar format. YAML comes to mind. Last time I looked
 (quite a while ago), it seemed insanely complex, but I think you
 could implement only a reasonable subset. OTOH, I think the tools
 ecosystem for parsing JSON (e.g., jq) is much better.

> Documentation for the new API is given in json-writer.h at the bottom of
> the first patch.

The API generally looks pleasant, but the nesting surprised me.  E.g.,
I'd have expected:

  jw_array_begin(out);
  jw_array_begin(out);
  jw_array_append_int(out, 42);
  jw_array_end(out);
  jw_array_end(out);

to result in an array containing an array containing an integer. But
array_begin() actually resets the strbuf, so you can't build up nested
items like this internally. Ditto for objects within objects. You have
to use two separate strbufs and copy the data an extra time.

To make the above work, I think you'd have to store a little more state.
E.g., the "array_append" functions check "out->len" to see if they need
to add a separating comma. That wouldn't work if we might be part of a
nested array. So I think you'd need a context struct like:

  struct json_writer {
int first_item;
struct strbuf out;
  };
  #define JSON_WRITER_INIT { 1, STRBUF_INIT }

to store the state and the output. As a bonus, you could also use it to
store some other sanity checks (e.g., keep a "depth" counter and BUG()
when somebody tries to access the finished strbuf with a hanging-open
object or array).

> I wasn't sure how to unit test the API from a shell script, so I added a
> helper command that does most of the work in the second patch.

In general I'd really prefer to keep the shell script as the driver for
the tests, and have t/helper programs just be conduits. E.g., something
like:

  cat >expect <<-\EOF &&
  {"key": "value", "foo": 42}
  EOF
  test-json-writer >actual \
object_begin \
object_append_string key value \
object_append_int foo 42 \
object_end &&
  test_cmp expect actual

It's a bit tedious (though fairly mechanical) to expose the API in this
way, but it makes it much easier to debug, modify, or add tests later
on (for example, I had to modify the C program to find out that my
append example above wouldn't work).

-Peff


[PATCH 0/2] routines to generate JSON data

2018-03-16 Thread git
From: Jeff Hostetler 

This patch series adds a set of utility routines to compose data in JSON
format into a "struct strbuf".  The resulting string can then be output
by commands wanting to support a JSON output format.

This is a stand alone patch.  Nothing currently uses these routines.  I'm
currently working on a series to log "telemetry" data (as we discussed
briefly during Ævar's "Performance Misc" session [1] in Barcelona last
week).  And I want emit the data in JSON rather than a fixed column/field
format.  The JSON routines here are independent of that, so it made sense
to submit the JSON part by itself.

Back when we added porcelain=v2 format to status, we talked about adding a
JSON format.  I think the routines in this patch would let us easily do
that, if someone were interested.  (Extending status is not on my radar
right now, however.)

Documentation for the new API is given in json-writer.h at the bottom of
the first patch.

I wasn't sure how to unit test the API from a shell script, so I added a
helper command that does most of the work in the second patch.

[1] https://public-inbox.org/git/20180313004940.gg61...@google.com/T/


Jeff Hostetler (2):
  json_writer: new routines to create data in JSON format
  json-writer: unit test

 Makefile|   2 +
 json-writer.c   | 224 
 json-writer.h   | 120 
 t/helper/test-json-writer.c | 146 +
 t/t0019-json-writer.sh  |  10 ++
 5 files changed, 502 insertions(+)
 create mode 100644 json-writer.c
 create mode 100644 json-writer.h
 create mode 100644 t/helper/test-json-writer.c
 create mode 100755 t/t0019-json-writer.sh

-- 
2.9.3