[Python-ideas] JSON encoder protocol

Stephen J. Turnbull Fri, 16 Aug 2019 00:59:27 -0700

Wes Turner writes:

 > Data interchange with structured types is worthwhile.


That's not what the main thread is about.  It's about adding support
for Decimal to the stdlib's json module.  Even the OP has explicitly
disclaimed pretty much everything else, although his preferred
implementation is more general than that.

I'm +1 on that.  I think the outline of how to do it has become pretty
obvious, and that it should be restricted to automatically converting
Decimals to a JSON number, perhaps under control of a use_decimal flag
for both encoding and decoding.

The rest should go into a separate thread.  First let's dispose of
this:

 > Streaming JSON is not possible without JSON lines support.

It is obvious to me that this should be handled in yet another thread
from "lossless JSON", because it can and should be independently
implemented, if it's done at all.  Given (ob,n) = raw_decode(idx=n)
support in the json module, the difficulty in implementing is all
about buffering, and choosing where to do that buffering (in a
separate module? in json.load? in a new json.load_stream generator?)

I will now argue that the __json__ protocol is nowhere near so
obviously stdlib-able as Decimal and streaming JSON.

 > An object.__json__(**kwargs) protocol would inconvenience no-one so
 > long as:
 > - decimal isn't imported unless used
 > - all existing code continues to work

I also think that JSON is widely enough used, and deserves better
semantic support, that a protocol (specifically, the __json__ dunder)
for serializer support and some form of complementary deserializer
support are quite justifiable.  But the __json__ dunder is the *easy*
part.  The complexity here is in that complementary deserializer.

Here's why.  To your desiderata I would add

- no complex type's module is imported unless used (easy)

- the deserializer support for a type should be linked to its
  serializer support (something like the codecs registry, but more
  complicated because each entity will need to invoke support
  separately, unlike codecs where there's one codec for a whole text)

- such object support should be automatically linked in to both the
  top level serializer and deserializer dispatching.

The latter two desiderata look *hard* to me.  Without them, you've got
the inverse of the current Decimal problem.  This is going to require
that somebody or somebodies spend many person-hours on design,
implementation, and testing.  Also

- the deserializer support may or may not want to be in json.loads()

because it may be preferable to deserialize to the primitive Python
objects that correspond to the JSON types, and then allow the Python
program to flexibly handle those.  Eg, what to do about variable
annotations?  Should our deserializer automatically deal with those?
What if a variable's value conflicts with its annotation?  While there
may be a clear answer to this question after somebody has thought
about it for a bit, it's not obvious to me.

The fundamental problem with your overall argument is that the
usefulness to the community at large is unclear:

 > It is unfortunate that we all just use JSON and throw away decimals
 > and float precision and datetimes because json.dumps is so easy.

True for yourself, I assume.  But json.dumps is *not* why *the rest of
us* do that.  We do it because we've *always* done it.  The Python
objects we are serializing themselves lack units, precision, and pet's
name!  Until our Python programs become unit- and precision- aware,
support for "lossless JSON" is necessarily going to be idiosyncratic,
and mostly avoided.

 > How many people know that:
 > 
 > - You can or should use decimal to avoid float precision error, but then
 > you have to annoyingly write a JSONEncoder to save that data, and then the
 > type is lost when it's parsed and cast to a float when it's deserialized?
 > 
 > - JSON-LD is the only non-ad-hoc solution to preserving precision,
 > datetimes, and complex numbers and types with JSON
 > 
 > - JSON5 supports IEEE 754 ±Infinity and NaN
 > 
 > - Pickles do serialize arbitrary objects, but are not safe for data
 > publishing because unmarshalling runs executable code in the pickle (this
 > is in the docs now)

Very few.  But again, that's the wrong set of questions, for reasons
similar to the above issue about "why we use json.dumps".  The right
questions are:

1.  Of those who don't know, how many have need to know, and will
    acknowledge that eed?  (If they don't admit it, good luck getting
    them to change their programs!)

2.  Of those who have need to know, how many would have "enough" of
    their serialization problems solved by any particular packaged set
    of features that might be added to the stdlib?

3.  Is the number of programs in 2 "large enough" to justify the
    additional maintenance burden and the risk that better but
    conflicting solutions will be created in the future?

 > JSON-LD is the way to go for complex types in JSON.

 > It's worth specifying a JSON serialization protocol as a PEP that
 > third-party and stdlib JSON implementations would use.

All of JSON-LD is way overkill for the examples of complex types
you've given.  We *do not need or want* a complete reimplementation
of the Semantic Web in JSON in our stdlib.  So what exactly are you
talking about?  Here's my idea:

I suspect your "serialization protocol" above really means
*deserialization* protocol.  object.__json__ is all the serialization
protocol we need, because it will produce a standard JSON stream that
can be deserialized (perhaps with different semantics!) by any
standard JSON deserializer.  Also, we don't need a PEP to specify the
protocol for providing a more accurate deserialization, JSON-LD
already did that work, and the parts we need are pretty trivial
(definitely @context, maybe @id).  So I interpret your word "protocol"
to mean "JSON-LD @context".  Is that close?

For almost all Python applications, a JSON-LD @context specific to
Python's object model and standard builtin types would be enough.
Since each type is itself a Python object, JSON-LD should be able to
represent user-defined classes and their instances within that
@context too.  For those programs that provide more semantic
information about their classes, they'd need additional, idiosyncratic
@context anyway, and I have no idea what a "standard extended
@context" would want to include.  Each large external package (NumPy,
Twisted) would want to implement its own @context, I think.

We could imagine additional semantic information in this @context that
would even tell you which modules you need to pip from PyPI to work
with these data types, along with the developers' and auditors'[1]
signatures you can authenticate the module and apply your trust model
to whether you want to import them.

Steve

Footnotes: 
[1]  Is this new?  I know that frequently software modules are signed
by their maintainers, and people decide to extend trust to particular
maintainers.  But in open source, anybody can audit, so a list of
auditors with signatures, dates, and a comment field for the audit
might also be useful for maintainers who aren't famous when the
auditors are famous.

Steve
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/XACTLM5TXCKM2MAZM4BKN677M2DU46QA/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] JSON encoder protocol

Reply via email to