I've written a rough draft of a PEP for standard library inclusion, attached
to this email. Comments/improvements welcome - I tried to leave most of the
differences between modules in the "Issues" section.
PEP: XXX
Title: A JSON handling library
Version: $Revision$
Last-Modified: $Date$
Author: John Millikin <[EMAIL PROTECTED]>
Discussions-To: web-sig@python.org
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 05-Apr-2008
Python-Version: 2.6
Post-History: XXX


Abstract
========

This PEP describes a proposed library for parsing and generating
data in the `JSON` [1]_ format. JSON stands for "JavaScript Object
Notation", and is described by RFC 4627 [2]_.

Rationale
=========

JSON is a widely-used data interchange format, often used for sending
data to and from a web browser using Javascript. Its simplicity and
ease of use has lead to various implementations with varying degrees
of compliance to the RFC. By bundling a capable implementation in
Python's standard library, I hope to reduce or eliminate the need
for choosing a JSON library.

Existing Public libraries
=========================

* Bob Ippolito's simplejson [3]_
* Deron Meranda's demjson [4]_
* John Millikin's jsonlib [5]_
* Alan Kennedy mentioned on web-sig [6]_ that he has written
  an implementation for Jython, but I couldn't find source code for
  it.

Each of these have different APIs, different degrees of strictness,
and different qualities of error handling.

Module Interface
================

Parsing
-------

Encoding Autodetection
''''''''''''''''''''''

The RFC requires that JSON is encoded in one of the Unicode encodings.
Because the first two bytes in a valid JSON expression are always from
the ASCII set, it is possible to reliably determine the encoding of
input data. Functions for autodetecting encoding exist in jsonlib and
demjson.

Parsing API
'''''''''''

A JSON expression may be parsed using the ``parse`` function::

  parse (bytes_or_string)

If the input is a ``bytes`` object, the encoding should be auto-detected
as above. If input has been recieved in a non-standard encoding, it can
be manually decoded and passed to ``parse`` as a string. The return
value is either a sequence or mapping, depending on the input.

Serialization
-------------

Python objects may be serialized using the ``generate`` function::

  generate (obj, indent = None, ascii_only = True, encoding = 'utf-8')
  
``indent`` is used to control pretty-printing. If ``None``, no pretty
printing will be performed and the output will be maximally compact.
If ``indent`` is a string, that string will be used for indenting
nested values. The only values allowed in ``indent`` are those that
are valid JSON whitespace; these are U+0009, U+000A, U+000D, and U+0020.

``ascii_only`` controls whether the output may contain characters above
the ASCII set. If ``True``, all non-ASCII characters must be escaped
using \\uXXXX syntax. Otherwise, non-ASCII characters will be included
without escaping. Depending on the output encoding and values of the
characters, this might be more size-efficient.

``encoding`` specifies how the output is to be encoded. If ``None``,
the output will be a Unicode string. By default, JSON is encoded in
UTF-8.

Note: this is the set of options generally supported by implementations.
For a full treatment of other options, see `Options for Serialization`_.

Other
-----

XXX Should the encoding autodetection function be a part of the
public API?

Issues
======

Representation of Fractional Numbers
------------------------------------

The author of jsonlib feels that fractional numbers should be parsed
into an instance of ``decimal.Decimal``, to avoid issues with values
that cannot be represented exactly by the ``float`` type
[7]_.

  The spec does not require a decimal, but I dislike losing information
  in the parsing stage. Any implementation in the standard library
  should, in my opinion, at least offer a parameter for lossless parsing
  of number values.

The author of simplejson disagrees [8]_, saying that:

  Practically speaking I've tried using decimal instead of float for
  JSON and it's generally The Wrong Thing To Do. The spec doesn't say
  what to do about numbers, but for proper JavaScript interaction you
  want to do things that approximate what JS is going to do: 64-bit
  floating point.

demjson appears to have some sort of float precision detection
mechanism, and returns instances of ``float`` only if they can
represent a value exactly.

Serializing User-defined Types
------------------------------

There should be some way for a user to specify how types not known
to the JSON library should be serialized. For example, django
needs to serialize types related to date and time.

* simplejson supports a ``default`` parameter to ``dump`` and
  ``dumps``, which should be a callable that accepts a value and
  returns a serializable object.
* demjson supports a ``json_equivalent`` method of objects to
  encode, or users may subclass the ``demjson.JSON`` class and
  override the ``encode_default`` method.
* jsonlib supports an ``on_unknown`` parameter to ``write``, which
  acts like simplejson's ``default``.
* Alan Kennedy's implementation checks for a __json__ method of
  objects to serialize [6]_.

Options for Serialization
-------------------------

There are options supported by only a few of the implementations:

``allow_nan``
  In ``simplejson``, allows Infinity and NaN to be serialized. These
  values are not supported by JSON, but are supported in JavaScript.
  
``check_circular``
  In ``simplejson``, allows the check for self-referential containers
  to be disabled.
  
``coerce_keys``
  In ``jsonlib``, forces non-string mapping keys to strings.
  
``default``
  In ``simplejson``, provides a hook for serializing user-defined
  types.
  
``indent``
  In ``simplejson``, an integer specifying the indentation level in
  spaces.
  
``on_unknown``
  In ``jsonlib``, serves the same purpose as simplejson's ``default``.
  
``separators``
  In ``simplejson``, allows the user to override the separators used
  for delimiting array and object values. There is no check performed
  as to whether this would produce invalid JSON. I think having this
  parameter is insane.
  
``skipkeys``
  In ``simplejson``, skips serializing mapping items with non-string
  keys.
  
``sort_keys``
  In ``jsonlib``, sorts mapping keys to provide consistent output for
  unit testing.
  
``strict``
  In ``demjson``, serves the same purpose as simplejson's
  ``allow_nan``.

Non-string Object Keys
----------------------

JSON allows only strings to be used as object keys. demjson in loose
mode allows non-string keys to be parsed, and simplejson will
automatically coerce some types to strings. simplejson has an option
for skipping non-string keys, and jsonlib has an option for coercing
them.

"Raw" atoms
-----------

JSON expressions must have an array or object as the outer-most
value -- that is, the expressions ``true``, ``42``, and ``"spam"``
are not valid JSON. Strict-mode demjson and jsonlib raise exceptions
when parsing or generating such an expression, simplejson does not.

This "feature" is widely supported, but it might just be a non-obvious
bug.

Trailing Commas
---------------

The text ``[1, 2, 3,]`` is valid in both JavaScript and Python, but
is invalid JSON. In JavaScript, this is an array of length four with
the items ``[1, 2, 3, undefined]``. In Python, it is a list of three
items.

Alan Kennedy mentioned that his parser has an option to support
reading these, so presumably he has a use case for it. He didn't
mention what it was parsed as.

Function Names
--------------

There is no real agreement on what the public functions should be
named. simplejson uses load[s] and dump[s], modeled after the
``pickle`` module. demjson uses ``decode`` and ``encode``. jsonlib
uses ``read`` and ``write``, modeled after the ``python-json``
module.

This PEP uses ``parse`` and ``generate`` because that is what the
``email`` module uses.

Module Name
-----------

Probably ``json``, but there's been no actual discussion or consensus
on it that I know of.

Lint for JSON
-------------

demjson comes with lint-like functionality. It would be nice to have
this available in the standard library as well, so that invalid JSON
could be detected without having to actually parse it.

Resources
=========

* `Comparing JSON modules for Python`__, by Deron Meranda.

  __ http://deron.meranda.us/python/comparing_json_modules/

References
==========

.. [1] Introducing JSON, contains general description of JSON and a list
   of implementations.
   (http://json.org/)

.. [2] RFC 4627
   (http://www.ietf.org/rfc/rfc4627.txt)

.. [3] http://pypi.python.org/pypi/simplejson/

.. [4] http://pypi.python.org/pypi/demjson/

.. [5] http://pypi.python.org/pypi/jsonlib/

.. [6] http://mail.python.org/pipermail/web-sig/2008-March/003332.html

.. [7] http://mail.python.org/pipermail/web-sig/2008-March/003343.html

.. [8] http://mail.python.org/pipermail/web-sig/2008-March/003336.html

Copyright
=========

This document has been placed in the public domain.



..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:

_______________________________________________
Web-SIG mailing list
Web-SIG@python.org
Web SIG: http://www.python.org/sigs/web-sig
Unsubscribe: 
http://mail.python.org/mailman/options/web-sig/archive%40mail-archive.com

Reply via email to