Am 12.10.2014 20:17, schrieb Andrei Alexandrescu:
Here's my destruction of std.data.json.
* lexer.d:
** Beautifully done. From what I understand, if the input is string or
immutable(ubyte)[] then the strings are carved out as slices of the
input, as opposed to newly allocated. Awesome.
** The string after lexing is correctly scanned and stored in raw format
(escapes are not rewritten) and decoded on demand. Problem with decoding
is that it may allocate memory, and it would be great (and not
difficult) to make the lexer 100% lazy/non-allocating. To achieve that,
lexer.d should define TWO "Kind"s of strings at the lexer level: regular
string and undecoded string. The former is lexer.d's way of saying "I
got lucky" in the sense that it didn't detect any '\\' so the raw and
decoded strings are identical. No need for anyone to do any further
processing in the majority of cases => win. The latter means the lexer
lexed the string, saw at least one '\\', and leaves it to the caller to
do the actual decoding.
This is actually more or less done in unescapeStringLiteral() - if it
doesn't find any '\\', it just returns the original string. Also
JSONString allows to access its .rawValue without doing any
decoding/allocations.
https://github.com/s-ludwig/std_data_json/blob/master/source/stdx/data/json/lexer.d#L1421
Unfortunately .rawValue can't be @nogc because the "raw" value might
have to be constructed first when the input is not a "string" (in this
case unescaping is done on-the-fly for efficiency reasons).
** After moving the decoding business out of lexer.d, a way to take this
further would be to qualify lexer methods as @nogc if the input is
string/immutable(ubyte)[]. I wonder how to implement a conditional
attribute. We'll probably need a language enhancement for that.
Isn't @nogc inferred? Everything is templated, so that should be
possible. Or does attribute inference only work for template function
and not for methods of templated types? Should it?
** The implementation uses manually-defined tagged unions for work.
Could we use Algebraic instead - dogfooding and all that? I recall there
was a comment in Sönke's original work that Algebraic has a specific
issue (was it false pointers?) - so the question arises, should we fix
Algebraic and use it thus helping other uses as well?
I had started on an implementation of a type and ID safe TaggedAlgebraic
that uses Algebraic for its internal storage. If we can get that in
first, it should be no problem to use it instead (with no or minimal API
breakage). However, it uses a struct instead of an enum to define the
"Kind" (which is the only nice way I could conceive to safely couple
enum value and type at compile time), so it's not as nice in the
generated documentation.
** I see the "boolean" kind, should we instead have the "true_" and
"false_" kinds?
I always found it cumbersome and awkward to work like that. What would
be the reason to go that route?
** Long story short I couldn't find any major issue with this module,
and I looked! I do think the decoding logic should be moved outside of
lexer.d or at least the JSONLexerRange.
* generator.d: looking good, no special comments. Like the consistent
use of structs filled with options as template parameters.
* foundation.d:
** At four words per token, Location seems pretty bulky. How about
reducing line and column to uint?
Single line JSON files >64k (or line counts >64k) are no exception, so
that would only work in a limited way. My thought about this was that it
is quite unusual to actually store the tokens for most purposes
(especially when directly serializing to a native D type), so that it
should have minimal impact on performance or memory consumption.
** Could JSONException create the message string in toString (i.e.
when/if used) as opposed to in the constructor?
That could of course be done, but the you'd not get the full error
message using ex.msg, only with ex.toString(), which usually prints a
call trace instead. Alternatively, it's also possible to completely
avoid using exceptions with LexOptions.noThrow.
* parser.d:
** How about using .init instead of .defaults for options?
I'd slightly tend to prefer the more explicit "defaults", especially
because "init" could mean either "defaults" or "none" (currently it
means "none"). But another idea would be to invert the option values so
that defaults==none... any objections?
** I'm a bit surprised by JSONParserNode.Kind. E.g. the objectStart/End
markers shouldn't appear as nodes. There should be an "object" node
only. I guess that's needed for laziness.
While you could infer the end of an object in the parser range by
looking for the first entry that doesn't start with a "key" node, the
same would not be possible for arrays, so in general the end marker *is*
required. Not that the parser range is a StAX style parser, which is
still very close to the lexical structure of the document.
I was also wondering if there might be a better name than
"JSONParserNode". It's not really embedded into a tree or graph
structure, which the name tends to suggest.
** It's unclear where memory is being allocated in the parser. @nogc
annotations wherever appropriate would be great.
The problem is that the parser accesses the lexer, which in turn
accesses the underlying input range, which in turn could allocate.
Depending on the options passed to the lexer, it could also throw, and
thus allocate, an exception. In the end only JSONParserRange.empty could
generally be made @nogc.
However, attribute inference should be possible here in theory (the
noThrow option is compile-time).
* value.d:
** Looks like this is/may be the only place where memory is being
managed, at least if the input is string/immutable(ubyte)[]. Right?
Yes, at least when setting aside optional exceptions and lazy allocations.
** Algebraic ftw.
============================
Overall: This is very close to everything I hoped! A bit more care to
@nogc would be awesome, especially with the upcoming focus on memory
management going forward.
I've tried to use @nogc (as well as nothrow) in more places, but mostly
due to not knowing if the underlying input range allocates, it hasn't
really been possible. Even on lower levels (private functions) almost
any Phobos function that is called is currently not @nogc for reasons
that are not always obvious, so I gave up on that for now.
After one more pass it would be great to move forward for review.
There is also still one pending change that I didn't finish yet; the
optional UTF input validation (never validate "string" inputs, but do
validate "ubyte[]" inputs).
Oh and there is the open issue of how to allocate in case of non-array
inputs. Initially I wanted to wait with this until we have an allocators
module, but Walter would like to have a way to do manual memory
management in the initial version. However, the ideal design is still
unclear to me - it would either simply resemble a general allocator
interface, or could use something like a callback that returns an output
range, which would probably be quite cumbersome to work with. Any ideas
in this direction would be welcome.
Sönke