Greets,
Lucy indexes will contain significant metadata, which should be written in a
human-readable format for easy spelunking and debugging. There are probably
four main contenders for choice of encoding: JSON, YAML, XML, and a custom
format.
If we go with a custom format, IMO it should be an extension of JSON. Our
needs will not be limited to simple key-value pairs, and designing our own
full-featured data-description language would be foolish. Let's try to avoid
custom formats until we decide that there's no other choice.
XML and YAML are certainly sophisticated enough to handle our data needs.
However, they both require large, heavyweight parsers, and I think we should
try to avoid imposing such a dependency on future Lucy C users.
Furthermore, XML is less well-matched to the scalar-list-mapping data
structures common to the dynamic languages that Lucy targets than either YAML
or JSON.
YAML offers the advantage of extensible data types. That's become more
appealing as I've tried to figure out how to serialize entire schemas in JSON,
including Analyzer and Similarity specifications. However, the YAML spec is
very large. If we decide that we need YAML's features, I think we ought to
try to limit ourselves to a subset of the spec.
Still, it would be for the best if we could avoid that kind of complexity, and
go with the simplest human-readable option that supports scalar-list-mapping
data structures: JSON.
This excerpt from the YAML 1.2 draft spec points a way forward:
http://yaml.org/spec/1.2
1.4. Relation to JSON
Both JSON and YAML aim to be human readable data interchange formats.
However, JSON and YAML have different priorities. JSON’s foremost design
goal is simplicity and universality. Thus, JSON is trivial to generate and
parse, at the cost of reduced human readability. It also uses a lowest
common denominator information model, ensuring any JSON data can be easily
processed by every modern programming environment.
In contrast, YAML’s foremost design goals are human readability and
support for serializing arbitrary native data structures. Thus, YAML
allows for extremely readable files, but is more complex to generate and
parse. In addition, YAML ventures beyond the lowest common denominator
data types, requiring more complex processing when crossing between
different programming environments.
YAML can therefore be viewed as a natural superset of JSON, offering
improved human readability and a more complete information model. This is
also the case in practice; every JSON file is also a valid YAML file. This
makes it easy to migrate from JSON to YAML if/when the additional features
are required.
It may be useful to define a intermediate format between YAML and JSON.
Such a format would be trivial to parse (but not very human readable),
like JSON. At the same time, it would allow for serializing arbitrary
native data structures, like YAML. Such a format might also serve as
YAML’s "canonical format".
Defining such a "YSON" format (YSON is a Serialized Object Notation) can
be done either by enhancing the JSON specification or by restricting the
YAML specification. Such a definition is beyond the scope of this
specification.
(Note that YAML version 1.2 is not well supported yet; most parsers support
1.0 or 1.1.)
I'm sure we can hammer all the data we need into JSON; it's just a matter of
at what point it becomes so inelegant that wandering outside the JSON spec
into YAML becomes the best solution. That's not a threshold we should cross
lightly, so for now I advocate that we try to work within JSON's constraints.
Marvin Humphrey