Re: [PATCH v2] routines to generate JSON data
On Wed, Mar 21 2018, g...@jeffhostetler.com wrote: > So, I'm not sure we have a route to get UTF-8-clean data out of Git, and if > we do it is beyond the scope of this patch series. > > So I think for our uses here, defining this as "JSON-like" is probably the > best answer. We write the strings as we received them (from the file system, > the index, or whatever). These strings are properly escaped WRT double > quotes, backslashes, and control characters, so we shouldn't have an issue > with decoders getting out of sync -- only with them rejecting non-UTF-8 > sequences. > > We could blindly \u encode each of the hi-bit characters, if that would > help the parsers, but I don't want to do that right now. > > WRT binary data, I had not intended using this for binary data. And without > knowing what kinds or quantity of binary data we might use it for, I'd like > to ignore this for now. I agree we should just ignore this problem for now given the immediate use-case.
Re: [PATCH v2] routines to generate JSON data
On Wed, Mar 21, 2018 at 07:28:26PM +, g...@jeffhostetler.com wrote: > It includes a new "struct json_writer" which is used to guide the > accumulation of JSON data -- knowing whether an object or array is > currently being composed. This allows error checking during construction. > > It also allows construction of nested structures using an inline model (in > addition to the original bottom-up composition). > > The test helper has been updated to include both the original unit tests and > a new scripting API to allow individual tests to be written directly in our > t/t*.sh shell scripts. Thanks for all of this. The changes look quite sensible to me (I do still suspect we could do the "first_item" thing without having to allocate, but I really like the assertions you were able to put in). > So I think for our uses here, defining this as "JSON-like" is probably the > best answer. We write the strings as we received them (from the file system, > the index, or whatever). These strings are properly escaped WRT double > quotes, backslashes, and control characters, so we shouldn't have an issue > with decoders getting out of sync -- only with them rejecting non-UTF-8 > sequences. Yeah, I think I've come to the same conclusion. My main goal in raising it now was to see if there was some other format we might use before we go too far down the JSON road. But as far as I can tell there really isn't another good option. > WRT binary data, I had not intended using this for binary data. And without > knowing what kinds or quantity of binary data we might use it for, I'd like > to ignore this for now. Yeah, I don't have any plans here either. I was thinking more about things like author names and file paths. -Peff
[PATCH v2] routines to generate JSON data
From: Jeff Hostetler This is version 2 of my JSON data format routines. This version addresses the non-utf8 questions raised on V1. It includes a new "struct json_writer" which is used to guide the accumulation of JSON data -- knowing whether an object or array is currently being composed. This allows error checking during construction. It also allows construction of nested structures using an inline model (in addition to the original bottom-up composition). The test helper has been updated to include both the original unit tests and a new scripting API to allow individual tests to be written directly in our t/t*.sh shell scripts. TODO I still don't know what to do about the Unicode/UTF-8 questions that were raised WRT strings. Pathnames on Linux can be any sequence of 8bit characters -- this is likely to be UTF-8 on modern systems. Pathnames on Windows are UCS2/UTF-16 in the filesystem and we always convert to/from UTF-8 when moving between git data structures and IO calls. There are few other fields (like author name) that we may want to log which may or may not be, but that is beyond our control. Even localized error messages may be problematic if they include other fields. So, I'm not sure we have a route to get UTF-8-clean data out of Git, and if we do it is beyond the scope of this patch series. So I think for our uses here, defining this as "JSON-like" is probably the best answer. We write the strings as we received them (from the file system, the index, or whatever). These strings are properly escaped WRT double quotes, backslashes, and control characters, so we shouldn't have an issue with decoders getting out of sync -- only with them rejecting non-UTF-8 sequences. We could blindly \u encode each of the hi-bit characters, if that would help the parsers, but I don't want to do that right now. WRT binary data, I had not intended using this for binary data. And without knowing what kinds or quantity of binary data we might use it for, I'd like to ignore this for now. Jeff Hostetler (1): json_writer: new routines to create data in JSON format Makefile| 2 + json-writer.c | 321 + json-writer.h | 86 + t/helper/test-json-writer.c | 420 t/t0019-json-writer.sh | 102 +++ 5 files changed, 931 insertions(+) create mode 100644 json-writer.c create mode 100644 json-writer.h create mode 100644 t/helper/test-json-writer.c create mode 100755 t/t0019-json-writer.sh -- 2.9.3