I'll have to read the whole thing, but are pure JSON parsers really the go-to for most people? I'm a big advocate of providing also something similar to XPath/XQuery and that's IMHO JSONiq (90% XQuery). I might be biased, of course, as I'm working on Brackit[1] in my spare time (which is also a query compiler and intended to be used with proven optimizations by document stores / JSON stores), but also can be used as an in-memory query engine.
kind regards Johannes [1] https://github.com/sirixdb/brackit Am Do., 15. Dez. 2022 um 23:03 Uhr schrieb Reinier Zwitserloot < rein...@zwitserloot.com>: > A recent Advent-of-Code puzzle also made me double check the support of > JSON in the java core libs and it is indeed a curious situation that the > java core libs don’t cater to it particularly well. > > However, I’m not seeing an easy way forward to try to close this hole in > the core library offerings. > > If you need to stream huge swaths of JSON, generally there’s a clear unit > size that you can just databind. Something like: > > String jsonStr = """ { "version": 5, "data": [ > -- 1 million relatively small records in this list -- > ] } """; > > > The usual swath of JSON parsers tend to support this (giving you a stream > of java instances created by databinding those small records one by one), > or if not, the best move forward is presumably to file a pull request with > those projects; the java.util.log experiment shows that trying to > ‘core-librarize’ needs that the community at large already fulfills with > third party deps isn’t a good move, especially if the core library variant > tries to oversimplify to avoid the trap of being too opinionated (which > core libs shouldn’t be). In other words, the need for ’stream this JSON for > me’ style APIs is even more exotic that Ethan is suggesting. > > I see a fundamental problem here: > > > - The 95%+ use case for working with JSON for your average java coder > is best done with data binding. > - core libs doesn’t want to provide it, partly because it’s got a > large design space, partly because the field’s already covered by GSON and > Jackson-json; java.util.log proves this doesn’t work. At least, I gather > that’s what Ethan thinks and I agree with this assessment. > - A language that claims to be “batteries included” that doesn’t ship > with a JSON parser in this era is dubious, to say the least. > > > I’m not sure how to square this circle. Hence it feels like core-libs > needs to hold some more fundamental debates first: > > > - Maybe it’s time to state in a more or less official decree that > well-established, large design space jobs will remain the purview of > dependencies no matter how popular it has, unless being part of the > core-libs adds something more fundamental the third party deps cannot bring > to the table (such as language integration), or the community standardizes > on a single library (JSR310’s story, more or less). JSON parsing would > qualify as ‘well-established’ (GSON and Jackson) and ‘large design space’ > as Ethan pointed out. > - Given that 99% of java projects, even really simple ones, start with > maven/gradle and a list of deps, is that really a problem? > > > I’m honestly not sure what the right answer is. On one hand, the npm > ecosystem seems to be doing very well even though their ‘batteries > included’ situation is an utter shambles. Then again, the notion that your > average nodejs project includes 10x+ more dependencies than other languages > is likely a significant part of the security clown fiesta going on over > there as far as 3rd party deps is concerned, so by no means should java > just blindly emulate their solutions. > > I don’t like the idea of shipping a non-data-binding JSON API in the core > libs. The root issue with JSON is that you just can’t tell how to interpret > any given JSON token, because that’s not how JSON is used in practice. What > does 5 mean? Could be that I’m to take that as an int, or as a double, or > perhaps even as a j.t.Instant (epoch-millis), and defaulting behaviour > (similar to j.u.Map’s .getOrDefault is *very* convenient to parse most > JSON out there in the real world - omitting k/v pairs whose value is still > on default is very common). That’s what makes those databind libraries so > enticing: Instead of trying to pattern match my way into this behaviour: > > > - If the element isn’t there at all or null, give me a list-of-longs > with a single 0 in it. > - If the element is a number, make me a list-of-longs with 1 value in > it, that is that number, as long. > - If the element is a string, parse it into a long, then get me a list > with this one long value (because IEEE double rules mean sometimes you have > to put these things in string form or they get mangled by javascript- > eval style parsers). > > > And yet the above is quite common, and can easily be done by a databinder, > which sees you want a List<Long> for a field whose default value is > List.of(1L), and, armed with that knowledge, can transit the JSON into > java in that way. > > You don’t *need* databinding to cater to this idea: You could for example > have a jsonNode.asLong(123) method that would parse a string if need be, > even. But this has nothing to do with pattern matching either. > > --Reinier Zwitserloot > > > On 15 Dec 2022 at 21:30:17, Ethan McCue <et...@mccue.dev> wrote: > >> I'm writing this to drive some forward motion and to nerd-snipe those who >> know better than I do into putting their thoughts into words. >> >> There are three ways to process JSON[1] >> - Streaming (Push or Pull) >> - Traversing a Tree (Realized or Lazy) >> - Declarative Databind (N ways) >> >> Of these, JEP-198 explicitly ruled out providing "JAXB style type safe >> data binding." >> >> No justification is given, but if I had to insert my own: mapping the >> Json model to/from the Java/JVM object model is a cursed combo of >> - Huge possible design space >> - Unpalatably large surface for backwards compatibility >> - Serialization! Boo![2] >> >> So for an artifact like the JDK, it probably doesn't make sense to >> include. That tracks. >> It won't make everyone happy, people like databind APIs, but it tracks. >> >> So for the "read flow" these are the things to figure out. >> >> | Should Provide? | Intended User(s) | >> ----------------+-----------------+------------------+ >> Streaming Push | | | >> ----------------+-----------------+------------------+ >> Streaming Pull | | | >> ----------------+-----------------+------------------+ >> Realized Tree | | | >> ----------------+-----------------+------------------+ >> Lazy Tree | | | >> ----------------+-----------------+------------------+ >> >> At which point, we should talk about what "meets needs of Java developers >> using JSON" implies. >> >> JSON is ubiquitous. Most kinds of software us schmucks write could have a >> reason to interact with it. >> The full set of "user personas" therefore aren't practical for me to talk >> about.[3] >> >> JSON documents, however, are not so varied. >> >> - There are small ones (1-10kb) >> - There are medium ones (10-1000kb) >> - There are big ones (1000kb-???) >> >> - There are shallow ones >> - There are deep ones >> >> So that feels like an easier direction to talk about it from. >> >> >> This repo[4] has some convenient toy examples of how some of those APIs >> look in libraries >> in the ecosystem. Specifically the Streaming Pull and Realized Tree >> models. >> >> User r = new User(); >> while (true) { >> JsonToken token = reader.peek(); >> switch (token) { >> case BEGIN_OBJECT: >> reader.beginObject(); >> break; >> case END_OBJECT: >> reader.endObject(); >> return r; >> case NAME: >> String fieldname = reader.nextName(); >> switch (fieldname) { >> case "id": >> r.setId(reader.nextString()); >> break; >> case "index": >> r.setIndex(reader.nextInt()); >> break; >> ... >> case "friends": >> r.setFriends(new ArrayList<>()); >> Friend f = null; >> carryOn = true; >> while (carryOn) { >> token = reader.peek(); >> switch (token) { >> case BEGIN_ARRAY: >> reader.beginArray(); >> break; >> case END_ARRAY: >> reader.endArray(); >> carryOn = false; >> break; >> case BEGIN_OBJECT: >> reader.beginObject(); >> f = new Friend(); >> break; >> case END_OBJECT: >> reader.endObject(); >> r.getFriends().add(f); >> break; >> case NAME: >> String fn = reader.nextName(); >> switch (fn) { >> case "id": >> >> f.setId(reader.nextString()); >> break; >> case "name": >> >> f.setName(reader.nextString()); >> break; >> } >> break; >> } >> } >> break; >> } >> } >> >> I think its not hard to argue that the streaming apis are brutalist. The >> above is Gson, but Jackson, moshi, etc >> seem at least morally equivalent. >> >> Its hard to write, hard to write *correctly*, and theres is a curious >> protensity towards pairing it >> with anemic, mutable models. >> >> That being said, it handles big documents and deep documents really well. >> It also performs >> pretty darn well and is good enough as a "fallback" when the intended >> user experience >> is through something like databind. >> >> So what could we do meaningfully better with the language we have >> today/will have tommorow? >> >> - Sealed interfaces + Pattern matching could give a nicer model for tokens >> >> sealed interface JsonToken { >> record Field(String name) implements JsonToken {} >> record BeginArray() implements JsonToken {} >> record EndArray() implements JsonToken {} >> record BeginObject() implements JsonToken {} >> record EndObject() implements JsonToken {} >> // ... >> } >> >> // ... >> >> User r = new User(); >> while (true) { >> JsonToken token = reader.peek(); >> switch (token) { >> case BeginObject __: >> reader.beginObject(); >> break; >> case EndObject __: >> reader.endObject(); >> return r; >> case Field("id"): >> r.setId(reader.nextString()); >> break; >> case Field("index"): >> r.setIndex(reader.nextInt()); >> break; >> >> // ... >> >> case Field("friends"): >> r.setFriends(new ArrayList<>()); >> Friend f = null; >> carryOn = true; >> while (carryOn) { >> token = reader.peek(); >> switch (token) { >> // ... >> >> - Value classes can make it all more efficient >> >> sealed interface JsonToken { >> value record Field(String name) implements JsonToken {} >> value record BeginArray() implements JsonToken {} >> value record EndArray() implements JsonToken {} >> value record BeginObject() implements JsonToken {} >> value record EndObject() implements JsonToken {} >> // ... >> } >> >> - (Fun One) We can transform a simpler-to-write push parser into a pull >> parser with Coroutines >> >> This is just a toy we could play with while making something in the >> JDK. I'm pretty sure >> we could make a parser which feeds into something like >> >> interface Listener { >> void onObjectStart(); >> void onObjectEnd(); >> void onArrayStart(); >> void onArrayEnd(); >> void onField(String name); >> // ... >> } >> >> and invert a loop like >> >> while (true) { >> char c = next(); >> switch (c) { >> case '{': >> listener.onObjectStart(); >> // ... >> // ... >> } >> } >> >> by putting a Coroutine.yield in the callback. >> >> That might be a meaningful simplification in code structure, I don't >> know enough to say. >> >> But, I think there are some hard questions like >> >> - Is the intent[5] to be make backing parser for ecosystem databind apis? >> - Is the intent that users who want to handle big/deep documents fall >> back to this? >> - Are those new language features / conveniences enough to offset the >> cost of committing to a new api? >> - To whom exactly does a low level api provide value? >> - What benefit is standardization in the JDK? >> >> and just generally - who would be the consumer(s) of this? >> >> The other kind of API still on the table is a Tree. There are two ways to >> handle this >> >> 1. Load it into `Object`. Use a bunch of instanceof checks/casts to >> confirm what it actually is. >> >> Object v; >> User u = new User(); >> >> if ((v = jso.get("id")) != null) { >> u.setId((String) v); >> } >> if ((v = jso.get("index")) != null) { >> u.setIndex(((Long) v).intValue()); >> } >> if ((v = jso.get("guid")) != null) { >> u.setGuid((String) v); >> } >> if ((v = jso.get("isActive")) != null) { >> u.setIsActive(((Boolean) v)); >> } >> if ((v = jso.get("balance")) != null) { >> u.setBalance((String) v); >> } >> // ... >> if ((v = jso.get("latitude")) != null) { >> u.setLatitude(v instanceof BigDecimal ? ((BigDecimal) >> v).doubleValue() : (Double) v); >> } >> if ((v = jso.get("longitude")) != null) { >> u.setLongitude(v instanceof BigDecimal ? ((BigDecimal) >> v).doubleValue() : (Double) v); >> } >> if ((v = jso.get("greeting")) != null) { >> u.setGreeting((String) v); >> } >> if ((v = jso.get("favoriteFruit")) != null) { >> u.setFavoriteFruit((String) v); >> } >> if ((v = jso.get("tags")) != null) { >> List<Object> jsonarr = (List<Object>) v; >> u.setTags(new ArrayList<>()); >> for (Object vi : jsonarr) { >> u.getTags().add((String) vi); >> } >> } >> if ((v = jso.get("friends")) != null) { >> List<Object> jsonarr = (List<Object>) v; >> u.setFriends(new ArrayList<>()); >> for (Object vi : jsonarr) { >> Map<String, Object> jso0 = (Map<String, Object>) vi; >> Friend f = new Friend(); >> f.setId((String) jso0.get("id")); >> f.setName((String) jso0.get("name")); >> u.getFriends().add(f); >> } >> } >> >> 2. Have an explicit model for Json, and helper methods that do said >> casts[6] >> >> >> this.setSiteSetting(readFromJson(jsonObject.getJsonObject("site"))); >> JsonArray groups = jsonObject.getJsonArray("group"); >> if(groups != null) >> { >> int len = groups.size(); >> for(int i=0; i<len; i++) >> { >> JsonObject grp = groups.getJsonObject(i); >> SNMPSetting grpSetting = readFromJson(grp); >> String grpName = grp.getString("dbgroup", null); >> if(grpName != null && grpSetting != null) >> this.groupSettings.put(grpName, grpSetting); >> } >> } >> JsonArray hosts = jsonObject.getJsonArray("host"); >> if(hosts != null) >> { >> int len = hosts.size(); >> for(int i=0; i<len; i++) >> { >> JsonObject host = hosts.getJsonObject(i); >> SNMPSetting hostSetting = readFromJson(host); >> String hostName = host.getString("dbhost", null); >> if(hostName != null && hostSetting != null) >> this.hostSettings.put(hostName, hostSetting); >> } >> } >> >> I think what has become easier to represent in the language nowadays is >> that explicit model for Json. >> Its the 101 lesson of sealed interfaces.[7] It feels nice and clean. >> >> sealed interface Json { >> final class Null implements Json {} >> final class True implements Json {} >> final class False implements Json {} >> final class Array implements Json {} >> final class Object implements Json {} >> final class String implements Json {} >> final class Number implements Json {} >> } >> >> And the cast-and-check approach is now more viable on account of pattern >> matching. >> >> if (jso.get("id") instanceof String v) { >> u.setId(v); >> } >> if (jso.get("index") instanceof Long v) { >> u.setIndex(v.intValue()); >> } >> if (jso.get("guid") instanceof String v) { >> u.setGuid(v); >> } >> >> // or >> >> if (jso.get("id") instanceof String id && >> jso.get("index") instanceof Long index && >> jso.get("guid") instanceof String guid) { >> return new User(id, index, guid, ...); // look ma, no setters! >> } >> >> >> And on the horizon, again, is value types. >> >> But there are problems with this approach beyond the performance >> implications of loading into >> a tree. >> >> For one, all the code samples above have different behaviors around null >> keys and missing keys >> that are not obvious from first glance. >> >> This won't accept any null or missing fields >> >> if (jso.get("id") instanceof String id && >> jso.get("index") instanceof Long index && >> jso.get("guid") instanceof String guid) { >> return new User(id, index, guid, ...); >> } >> >> This will accept individual null or missing fields, but also will >> silently ignore >> fields with incorrect types >> >> if (jso.get("id") instanceof String v) { >> u.setId(v); >> } >> if (jso.get("index") instanceof Long v) { >> u.setIndex(v.intValue()); >> } >> if (jso.get("guid") instanceof String v) { >> u.setGuid(v); >> } >> >> And, compared to databind where there is information about the expected >> structure of the document >> and its the job of the framework to assert that, I posit that the errors >> that would be encountered >> when writing code against this would be more like >> >> "something wrong with user" >> >> than >> >> "problem at users[5].name, expected string or null. got 5" >> >> Which feels unideal. >> >> >> One approach I find promising is something close to what Elm does with >> its decoders[8]. Not just combining assertion >> and binding like what pattern matching with records allows, but including >> a scheme for bubbling/nesting errors. >> >> static String string(Json json) throws JsonDecodingException { >> if (!(json instanceof Json.String jsonString)) { >> throw JsonDecodingException.of( >> "expected a string", >> json >> ); >> } else { >> return jsonString.value(); >> } >> } >> >> static <T> T field(Json json, String fieldName, Decoder<? extends T> >> valueDecoder) throws JsonDecodingException { >> var jsonObject = object(json); >> var value = jsonObject.get(fieldName); >> if (value == null) { >> throw JsonDecodingException.atField( >> fieldName, >> JsonDecodingException.of( >> "no value for field", >> json >> ) >> ); >> } >> else { >> try { >> return valueDecoder.decode(value); >> } catch (JsonDecodingException e) { >> throw JsonDecodingException.atField( >> fieldName, >> e >> ); >> } catch (Exception e) { >> throw JsonDecodingException.atField(fieldName, >> JsonDecodingException.of(e, value)); >> } >> } >> } >> >> Which I think has some benefits over the ways I've seen of working with >> trees. >> >> >> >> - It is declarative enough that folks who prefer databind might be happy >> enough. >> >> static User fromJson(Json json) { >> return new User( >> Decoder.field(json, "id", Decoder::string), >> Decoder.field(json, "index", Decoder::long_), >> Decoder.field(json, "guid", Decoder::string), >> ); >> } >> >> / ... >> >> List<User> users = Decoders.array(json, User::fromJson); >> >> - Handling null and optional fields could be less easily conflated >> >> Decoder.field(json, "id", Decoder::string); >> >> Decoder.nullableField(json, "id", Decoder::string); >> >> Decoder.optionalField(json, "id", Decoder::string); >> >> Decoder.optionalNullableField(json, "id", Decoder::string); >> >> >> - It composes well with user defined classes >> >> record Guid(String value) { >> Guid { >> // some assertions on the structure of value >> } >> } >> >> Decoder.string(json, "guid", guid -> new Guid(Decoder.string(guid))); >> >> // or even >> >> record Guid(String value) { >> Guid { >> // some assertions on the structure of value >> } >> >> static Guid fromJson(Json json) { >> return new Guid(Decoder.string(guid)); >> } >> } >> >> Decoder.string(json, "guid", Guid::fromJson); >> >> >> - When something goes wrong, the API can handle the fiddlyness of >> capturing information for feedback. >> >> In the code I've sketched out its just what field/index things went >> wrong at. Potentially >> capturing metadata like row/col numbers of the source would be >> sensible too. >> >> Its just not reasonable to expect devs to do extra work to get that >> and its really nice to give it. >> >> There are also some downsides like >> >> - I do not know how compatible it would be with lazy trees. >> >> Lazy trees being the only way that a tree api could handle big or >> deep documents. >> The general concept as applied in libraries like json-tree[9] is to >> navigate without >> doing any work, and that clashes with wanting to instanceof check >> the info at the >> current path. >> >> - It *almost* gives enough information to be a general schema approach >> >> If one field fails, that in the model throws an exception >> immediately. If an API should >> return "errors": [...], that is inconvenient to construct. >> >> - None of the existing popular libraries are doing this >> >> The only mechanics that are strictly required to give this sort of >> API is lambdas. Those have >> been out for a decade. Yes sealed interfaces make the data model >> prettier but in concept you >> can build the same thing on top of anything. >> >> I could argue that this is because of "cultural momentum" of >> databind or some other reason, >> but the fact remains that it isn't a proven out approach. >> >> Writing Json libraries is a todo list[10]. There are a lot of bad >> ideas and this might be one of the, >> >> - Performance impact of so many instanceof checks >> >> I've gotten a 4.2% slowdown compared to the "regular" tree code >> without the repeated casts. >> >> But that was with a parser that is 5x slower than Jacksons. (using >> the same benchmark project as for the snippets). >> I think there could be reason to believe that the JIT does well >> enough with repeated instanceof >> checks to consider it. >> >> >> My current thinking is that - despite not solving for large or deep >> documents - starting with a really "dumb" realized tree api >> might be the right place to start for the read side of a potential >> incubator module. >> >> But regardless - this feels like a good time to start more concrete >> conversations. I fell I should cap this email since I've reached the point >> of decoherence and haven't even mentioned the write side of things >> >> >> >> >> [1]: http://www.cowtowncoder.com/blog/archives/2009/01/entry_131.html >> [2]: https://security.snyk.io/vuln/maven?search=jackson-databind >> [3]: I only know like 8 people >> [4]: >> https://github.com/fabienrenaud/java-json-benchmark/blob/master/src/main/java/com/github/fabienrenaud/jjb/stream/UsersStreamDeserializer.java >> [5]: When I say "intent", I do so knowing full well no one has been >> actively thinking of this for an entire Game of Thrones >> [6]: >> https://github.com/yahoo/mysql_perf_analyzer/blob/master/myperf/src/main/java/com/yahoo/dba/perf/myperf/common/SNMPSettings.java >> [7]: https://www.infoq.com/articles/data-oriented-programming-java/ >> [8]: https://package.elm-lang.org/packages/elm/json/latest/Json-Decode >> [9]: https://github.com/jbee/json-tree >> [10]: https://stackoverflow.com/a/14442630/2948173 >> [11]: In 30 days JEP-198 it will be recognizably PI days old for the 2nd >> time in its history. >> [12]: To me, the fact that is still an open JEP is more a social >> convenience than anything. I could just as easily writing this exact same >> email about TOML. >> >