Re: JEP-198 - Lets start talking about JSON

Brian Goetz Tue, 28 Feb 2023 11:48:35 -0800

As you can probably imagine, I've been thinking about these topics forquite a while, ever since we started working on records and patternmatching. It sounds like a lot of your thoughts have followed a similararc to ours.

I'll share with you some of our thoughts, but I can't be engaging in adetailed back-and-forth right now -- we have too many other things goingon, and this isn't yet on the front burner. I think there's a righttime for this work, and we're not quite there yet, but we'll get theresoon enough and we'll pick up the ball again then.

To the existential question: yes, there should be a simpler, built-inway to parse JSON. And, as you observe, the railroad diagram in theJSON spec is a graphical description of an algebraic data type. One ofthe great simplifying effects of having algebraic data types (records +sealed classes) in the language is that many data modeling problemscollapse down to the point where considerably less creativity isrequired of an API. Here's the JSON API one can write after literallyonly 30 seconds of thought:

sealed interface JsonValue{

record JsonString(String s)implements JsonValue{ }

record JsonNumber(double d)implements JsonValue{ }

record JsonNull()implements JsonValue{ }

record JsonBoolean(booleanb)implements JsonValue{ }

record JsonArray(List<JsonValue> values)implements JsonValue{ }

record JsonObject(Map<String, JsonValue> pairs)implements JsonValue{ }

}

It matches the JSON spec almost literally, and you can use patternmatching to parse a document. (OK, there's some tiny bit of creativityhere in that True/False have been collapsed to a single JsonBooleantype, but you get my point.)

But, we're not quite ready to put this API into the JDK, because thelanguage isn't *quite* there yet. Records give you nice patternmatching, but they come at a cost; they're very specific and have rigidideas about initialization, which ripples into a number of constraintson an implementation (i.e., much harder to parse lazily.) So we'rewaiting until we have deconstruction patterns (next up on the patternsparade) so that the records above can be interfaces and still supportpattern matching (and more flexibility in implementation, includingusing value classes when they arrive.) It's not a long hop, though.

I agree with your assessment of streaming models; for documents toolarge to fit into memory, we'll let someone else provide a specializedsolution. Streaming and fully-materialized-tree are not the only twooptions; there are plenty of points in the middle.

As to API idioms, these can be layered. The lazy-tree model outlinedabove can be a foundation for data binding, dynamic mapping to records,jsonpath, etc. But once you've made the streaming-vs-materializedchoice in favor of materialized, it's hard to imagine not havingsomething like the above at the base of the tower.

The question you raise about error handling is one that infuses patternmatching in general. Pattern matching allows us to collapse what wouldbe a thousand questions -- "does key X exist? is it mapped to anumber? is the number in the range of byte?" -- each with their ownfailure-handling path, into a single question. That's great forreliable and readable code, but it does make errors more opaque, becauseit is more like the red "check engine" light on your dashboard. (Something like JSONPath could generate better error messages sinceyou've given it a declarative description of an assumed structuralinvariant.) But, imperative code that has to treat each structuralassumption as a possible control-flow point is a disaster; we've seentoo much code like this already.

The ecosystem is big enough that there will be lots of people withstrong opinions that "X is the only sensible way to do it" (we'vealready seen X=databinding on this thread), but the reality is thatthere are multiple overlapping audiences here, and we have to be clearwhich audiences we are prioritizing. We can have that debate when thetime is right.

So, we'll get there, but we're waiting for one or two more bits oflanguage evolution to give us the substrate for the API that feels right.


Hope this helps,
-Brian


On 12/15/2022 3:30 PM, Ethan McCue wrote:

I'm writing this to drive some forward motion and to nerd-snipe thosewho know better than I do into putting their thoughts into words.


There are three ways to process JSON[1]
- Streaming (Push or Pull)
- Traversing a Tree (Realized or Lazy)
- Declarative Databind (N ways)

Of these, JEP-198 explicitly ruled out providing "JAXB style type safedata binding."

No justification is given, but if I had to insert my own: mapping theJson model to/from the Java/JVM object model is a cursed combo of

- Huge possible design space
- Unpalatably large surface for backwards compatibility
- Serialization! Boo![2]

So for an artifact like the JDK, it probably doesn't make sense toinclude. That tracks.

It won't make everyone happy, people like databind APIs, but it tracks.

So for the "read flow" these are the things to figure out.

                | Should Provide? | Intended User(s) |
----------------+-----------------+------------------+
 Streaming Push |                 |                  |
----------------+-----------------+------------------+
 Streaming Pull |                 |                  |
----------------+-----------------+------------------+
 Realized Tree  |                 |                  |
----------------+-----------------+------------------+
 Lazy Tree      |                 |                  |
----------------+-----------------+------------------+

At which point, we should talk about what "meets needs of Javadevelopers using JSON" implies.

JSON is ubiquitous. Most kinds of software us schmucks write couldhave a reason to interact with it.The full set of "user personas" therefore aren't practical for me totalk about.[3]


JSON documents, however, are not so varied.

- There are small ones (1-10kb)
- There are medium ones (10-1000kb)
- There are big ones (1000kb-???)

- There are shallow ones
- There are deep ones

So that feels like an easier direction to talk about it from.

This repo[4] has some convenient toy examples of how some of thoseAPIs look in librariesin the ecosystem. Specifically the Streaming Pull and Realized Treemodels.


        User r = new User();
        while (true) {
            JsonToken token = reader.peek();
            switch (token) {
                case BEGIN_OBJECT:
                    reader.beginObject();
                    break;
                case END_OBJECT:
                    reader.endObject();
                    return r;
                case NAME:
                    String fieldname = reader.nextName();
                    switch (fieldname) {
                        case "id":
                            r.setId(reader.nextString());
                            break;
                        case "index":
                            r.setIndex(reader.nextInt());
                            break;
                        ...
                        case "friends":
                            r.setFriends(new ArrayList<>());
                            Friend f = null;
                            carryOn = true;
                            while (carryOn) {
                                token = reader.peek();
                                switch (token) {
                                    case BEGIN_ARRAY:
                                        reader.beginArray();
                                        break;
                                    case END_ARRAY:
                                        reader.endArray();
                                        carryOn = false;
                                        break;
                                    case BEGIN_OBJECT:
                                        reader.beginObject();
                                        f = new Friend();
                                        break;
                                    case END_OBJECT:
                                        reader.endObject();
                                        r.getFriends().add(f);
                                        break;
                                    case NAME:
                                        String fn = reader.nextName();
                                        switch (fn) {
                                            case "id":
f.setId(reader.nextString());
                                                break;
                                            case "name":
f.setName(reader.nextString());
                                                break;
                                        }
                                        break;
                                }
                            }
                            break;
                    }
            }

I think its not hard to argue that the streaming apis are brutalist.The above is Gson, but Jackson, moshi, etc

seem at least morally equivalent.

Its hard to write, hard to write *correctly*, and theres is a curiousprotensity towards pairing it

with anemic, mutable models.

That being said, it handles big documents and deep documents reallywell. It also performspretty darn well and is good enough as a "fallback" when the intendeduser experience

is through something like databind.

So what could we do meaningfully better with the language we havetoday/will have tommorow?


- Sealed interfaces + Pattern matching could give a nicer model for tokens

        sealed interface JsonToken {
            record Field(String name) implements JsonToken {}
            record BeginArray() implements JsonToken {}
            record EndArray() implements JsonToken {}
            record BeginObject() implements JsonToken {}
            record EndObject() implements JsonToken {}
            // ...
        }

        // ...

        User r = new User();
        while (true) {
            JsonToken token = reader.peek();
            switch (token) {
                case BeginObject __:
                    reader.beginObject();
                    break;
                case EndObject __:
                    reader.endObject();
                    return r;
                case Field("id"):
                    r.setId(reader.nextString());
                    break;
                case Field("index"):
                    r.setIndex(reader.nextInt());
                    break;

                // ...

                case Field("friends"):
                    r.setFriends(new ArrayList<>());
                    Friend f = null;
                    carryOn = true;
                    while (carryOn) {
                        token = reader.peek();
                        switch (token) {
                // ...

- Value classes can make it all more efficient

        sealed interface JsonToken {
            value record Field(String name) implements JsonToken {}
            value record BeginArray() implements JsonToken {}
            value record EndArray() implements JsonToken {}
            value record BeginObject() implements JsonToken {}
            value record EndObject() implements JsonToken {}
            // ...
        }

- (Fun One) We can transform a simpler-to-write push parser into apull parser with Coroutines

This is just a toy we could play with while making something inthe JDK. I'm pretty sure

    we could make a parser which feeds into something like

        interface Listener {
            void onObjectStart();
            void onObjectEnd();
            void onArrayStart();
            void onArrayEnd();
            void onField(String name);
            // ...
        }

    and invert a loop like

        while (true) {
            char c = next();
            switch (c) {
                case '{':
                    listener.onObjectStart();
                    // ...
                // ...
            }
        }

    by putting a Coroutine.yield in the callback.

That might be a meaningful simplification in code structure, Idon't know enough to say.


But, I think there are some hard questions like

- Is the intent[5] to be make backing parser for ecosystem databind apis?

- Is the intent that users who want to handle big/deep documents fallback to this?- Are those new language features / conveniences enough to offset thecost of committing to a new api?

- To whom exactly does a low level api provide value?
- What benefit is standardization in the JDK?

and just generally - who would be the consumer(s) of this?

The other kind of API still on the table is a Tree. There are two waysto handle this

1. Load it into `Object`. Use a bunch of instanceof checks/casts toconfirm what it actually is.


        Object v;
        User u = new User();

        if ((v = jso.get("id")) != null) {
            u.setId((String) v);
        }
        if ((v = jso.get("index")) != null) {
            u.setIndex(((Long) v).intValue());
        }
        if ((v = jso.get("guid")) != null) {
            u.setGuid((String) v);
        }
        if ((v = jso.get("isActive")) != null) {
            u.setIsActive(((Boolean) v));
        }
        if ((v = jso.get("balance")) != null) {
            u.setBalance((String) v);
        }
        // ...
        if ((v = jso.get("latitude")) != null) {

u.setLatitude(v instanceof BigDecimal ? ((BigDecimal)v).doubleValue() : (Double) v);

        }
        if ((v = jso.get("longitude")) != null) {

u.setLongitude(v instanceof BigDecimal ? ((BigDecimal)v).doubleValue() : (Double) v);

        }
        if ((v = jso.get("greeting")) != null) {
            u.setGreeting((String) v);
        }
        if ((v = jso.get("favoriteFruit")) != null) {
            u.setFavoriteFruit((String) v);
        }
        if ((v = jso.get("tags")) != null) {
            List<Object> jsonarr = (List<Object>) v;
            u.setTags(new ArrayList<>());
            for (Object vi : jsonarr) {
                u.getTags().add((String) vi);
            }
        }
        if ((v = jso.get("friends")) != null) {
            List<Object> jsonarr = (List<Object>) v;
            u.setFriends(new ArrayList<>());
            for (Object vi : jsonarr) {
                Map<String, Object> jso0 = (Map<String, Object>) vi;
                Friend f = new Friend();
                f.setId((String) jso0.get("id"));
                f.setName((String) jso0.get("name"));
                u.getFriends().add(f);
            }
        }

2. Have an explicit model for Json, and helper methods that do saidcasts[6]



this.setSiteSetting(readFromJson(jsonObject.getJsonObject("site")));
JsonArray groups = jsonObject.getJsonArray("group");
if(groups != null)
{
int len = groups.size();
for(int i=0; i<len; i++)
{
JsonObject grp = groups.getJsonObject(i);
SNMPSetting grpSetting = readFromJson(grp);
String grpName = grp.getString("dbgroup", null);
if(grpName != null && grpSetting != null)
this.groupSettings.put(grpName, grpSetting);
}
}
JsonArray hosts = jsonObject.getJsonArray("host");
if(hosts != null)
{
int len = hosts.size();
for(int i=0; i<len; i++)
{
JsonObject host = hosts.getJsonObject(i);
SNMPSetting hostSetting = readFromJson(host);
String hostName = host.getString("dbhost", null);
if(hostName != null && hostSetting != null)
this.hostSettings.put(hostName, hostSetting);
}
}

I think what has become easier to represent in the language nowadaysis that explicit model for Json.

Its the 101 lesson of sealed interfaces.[7] It feels nice and clean.

        sealed interface Json {
            final class Null implements Json {}
            final class True implements Json {}
            final class False implements Json {}
            final class Array implements Json {}
            final class Object implements Json {}
            final class String implements Json {}
            final class Number implements Json {}
        }

And the cast-and-check approach is now more viable on account ofpattern matching.


        if (jso.get("id") instanceof String v) {
            u.setId(v);
        }
        if (jso.get("index") instanceof Long v) {
            u.setIndex(v.intValue());
        }
        if (jso.get("guid") instanceof String v) {
            u.setGuid(v);
        }

        // or

        if (jso.get("id") instanceof String id &&
                jso.get("index") instanceof Long index &&
                jso.get("guid") instanceof String guid) {
            return new User(id, index, guid, ...); // look ma, no setters!
        }


And on the horizon, again, is value types.

But there are problems with this approach beyond the performanceimplications of loading into

a tree.

For one, all the code samples above have different behaviors aroundnull keys and missing keys

that are not obvious from first glance.

This won't accept any null or missing fields

        if (jso.get("id") instanceof String id &&
                jso.get("index") instanceof Long index &&
                jso.get("guid") instanceof String guid) {
            return new User(id, index, guid, ...);
        }

This will accept individual null or missing fields, but also willsilently ignore

fields with incorrect types

        if (jso.get("id") instanceof String v) {
            u.setId(v);
        }
        if (jso.get("index") instanceof Long v) {
            u.setIndex(v.intValue());
        }
        if (jso.get("guid") instanceof String v) {
            u.setGuid(v);
        }

And, compared to databind where there is information about theexpected structure of the documentand its the job of the framework to assert that, I posit that theerrors that would be encountered

when writing code against this would be more like

    "something wrong with user"

than

    "problem at users[5].name, expected string or null. got 5"

Which feels unideal.

One approach I find promising is something close to what Elm does withits decoders[8]. Not just combining assertionand binding like what pattern matching with records allows, butincluding a scheme for bubbling/nesting errors.


    static String string(Json json) throws JsonDecodingException {
        if (!(json instanceof Json.String jsonString)) {
            throw JsonDecodingException.of(
                    "expected a string",
                    json
            );
        } else {
            return jsonString.value();
        }
    }

static <T> T field(Json json, String fieldName, Decoder<? extendsT> valueDecoder) throws JsonDecodingException {

        var jsonObject = object(json);
        var value = jsonObject.get(fieldName);
        if (value == null) {
            throw JsonDecodingException.atField(
                    fieldName,
                    JsonDecodingException.of(
                            "no value for field",
                            json
                    )
            );
        }
        else {
            try {
                return valueDecoder.decode(value);
            } catch (JsonDecodingException e) {
                throw JsonDecodingException.atField(
                        fieldName,
                        e
                );
            }  catch (Exception e) {

throw JsonDecodingException.atField(fieldName,JsonDecodingException.of(e, value));

            }
        }
    }

Which I think has some benefits over the ways I've seen of workingwith trees.

- It is declarative enough that folks who prefer databind might behappy enough.


        static User fromJson(Json json) {
            return new User(
                Decoder.field(json, "id", Decoder::string),
                Decoder.field(json, "index", Decoder::long_),
                Decoder.field(json, "guid", Decoder::string),
            );
        }

        / ...

        List<User> users = Decoders.array(json, User::fromJson);

- Handling null and optional fields could be less easily conflated

    Decoder.field(json, "id", Decoder::string);

    Decoder.nullableField(json, "id", Decoder::string);

    Decoder.optionalField(json, "id", Decoder::string);

    Decoder.optionalNullableField(json, "id", Decoder::string);


- It composes well with user defined classes

    record Guid(String value) {
        Guid {
            // some assertions on the structure of value
        }
    }

    Decoder.string(json, "guid", guid -> new Guid(Decoder.string(guid)));

    // or even

    record Guid(String value) {
        Guid {
            // some assertions on the structure of value
        }

        static Guid fromJson(Json json) {
            return new Guid(Decoder.string(guid));
        }
    }

    Decoder.string(json, "guid", Guid::fromJson);

- When something goes wrong, the API can handle the fiddlyness ofcapturing information for feedback.

In the code I've sketched out its just what field/index thingswent wrong at. Potentially capturing metadata like row/col numbers of the source would besensible too.

Its just not reasonable to expect devs to do extra work to getthat and its really nice to give it.


There are also some downsides like

-  I do not know how compatible it would be with lazy trees.

Lazy trees being the only way that a tree api could handle big ordeep documents. The general concept as applied in libraries like json-tree[9] isto navigate without doing any work, and that clashes with wanting to instanceof checkthe info at the

     current path.

- It *almost* gives enough information to be a general schema approach

If one field fails, that in the model throws an exceptionimmediately. If an API should

    return "errors": [...], that is inconvenient to construct.

- None of the existing popular libraries are doing this

The only mechanics that are strictly required to give this sortof API is lambdas. Those have been out for a decade. Yes sealed interfaces make the data modelprettier but in concept you

     can build the same thing on top of anything.

I could argue that this is because of "cultural momentum" ofdatabind or some other reason,

     but the fact remains that it isn't a proven out approach.

Writing Json libraries is a todo list[10]. There are a lot of badideas and this might be one of the,


- Performance impact of so many instanceof checks

I've gotten a 4.2% slowdown compared to the "regular" tree codewithout the repeated casts.

But that was with a parser that is 5x slower than Jacksons. (usingthe same benchmark project as for the snippets). I think there could be reason to believe that the JIT does wellenough with repeated instanceof

    checks to consider it.

My current thinking is that - despite not solving for large or deepdocuments - starting with a really "dumb" realized tree apimight be the right place to start for the read side of a potentialincubator module.

But regardless - this feels like a good time to start more concreteconversations. I fell I should cap this email since I've reached thepoint of decoherence and haven't even mentioned the write side of things





[1]: http://www.cowtowncoder.com/blog/archives/2009/01/entry_131.html
[2]: https://security.snyk.io/vuln/maven?search=jackson-databind
[3]: I only know like 8 people

[4]:https://github.com/fabienrenaud/java-json-benchmark/blob/master/src/main/java/com/github/fabienrenaud/jjb/stream/UsersStreamDeserializer.java[5]: When I say "intent", I do so knowing full well no one has beenactively thinking of this for an entire Game of Thrones[6]:https://github.com/yahoo/mysql_perf_analyzer/blob/master/myperf/src/main/java/com/yahoo/dba/perf/myperf/common/SNMPSettings.java

[7]: https://www.infoq.com/articles/data-oriented-programming-java/
[8]: https://package.elm-lang.org/packages/elm/json/latest/Json-Decode
[9]: https://github.com/jbee/json-tree
[10]: https://stackoverflow.com/a/14442630/2948173

[11]: In 30 days JEP-198 it will be recognizably PI days old for the2nd time in its history.[12]: To me, the fact that is still an open JEP is more a socialconvenience than anything. I could just as easily writing this exactsame email about TOML.

Re: JEP-198 - Lets start talking about JSON

Reply via email to