Metadata encoding

Marvin Humphrey Sun, 22 Mar 2009 19:18:50 -0700

Greets,

Lucy indexes will contain significant metadata, which should be written in a
human-readable format for easy spelunking and debugging.  There are probably
four main contenders for choice of encoding: JSON, YAML, XML, and a custom
format.


If we go with a custom format, IMO it should be an extension of JSON.  Our
needs will not be limited to simple key-value pairs, and designing our own
full-featured data-description language would be foolish.  Let's try to avoid
custom formats until we decide that there's no other choice.

XML and YAML are certainly sophisticated enough to handle our data needs.
However, they both require large, heavyweight parsers, and I think we should
try to avoid imposing such a dependency on future Lucy C users.

Furthermore, XML is less well-matched to the scalar-list-mapping data
structures common to the dynamic languages that Lucy targets than either YAML
or JSON.  

YAML offers the advantage of extensible data types.  That's become more
appealing as I've tried to figure out how to serialize entire schemas in JSON,
including Analyzer and Similarity specifications.  However, the YAML spec is
very large.  If we decide that we need YAML's features, I think we ought to
try to limit ourselves to a subset of the spec.

Still, it would be for the best if we could avoid that kind of complexity, and
go with the simplest human-readable option that supports scalar-list-mapping
data structures: JSON.

This excerpt from the YAML 1.2 draft spec points a way forward:

    http://yaml.org/spec/1.2

    1.4. Relation to JSON

    Both JSON and YAML aim to be human readable data interchange formats.
    However, JSON and YAML have different priorities. JSON’s foremost design
    goal is simplicity and universality. Thus, JSON is trivial to generate and
    parse, at the cost of reduced human readability. It also uses a lowest
    common denominator information model, ensuring any JSON data can be easily
    processed by every modern programming environment.

    In contrast, YAML’s foremost design goals are human readability and
    support for serializing arbitrary native data structures. Thus, YAML
    allows for extremely readable files, but is more complex to generate and
    parse. In addition, YAML ventures beyond the lowest common denominator
    data types, requiring more complex processing when crossing between
    different programming environments. 

    YAML can therefore be viewed as a natural superset of JSON, offering
    improved human readability and a more complete information model. This is
    also the case in practice; every JSON file is also a valid YAML file. This
    makes it easy to migrate from JSON to YAML if/when the additional features
    are required.

    It may be useful to define a intermediate format between YAML and JSON.
    Such a format would be trivial to parse (but not very human readable),
    like JSON. At the same time, it would allow for serializing arbitrary
    native data structures, like YAML. Such a format might also serve as
    YAML’s "canonical format".

    Defining such a "YSON" format (YSON is a Serialized Object Notation) can
    be done either by enhancing the JSON specification or by restricting the
    YAML specification. Such a definition is beyond the scope of this
    specification. 

(Note that YAML version 1.2 is not well supported yet; most parsers support
1.0 or 1.1.)

I'm sure we can hammer all the data we need into JSON; it's just a matter of
at what point it becomes so inelegant that wandering outside the JSON spec
into YAML becomes the best solution.  That's not a threshold we should cross
lightly, so for now I advocate that we try to work within JSON's constraints.

Marvin Humphrey

Metadata encoding

Reply via email to