Next update!
I got both python [0] and js [1] serializers to work! I can't say they are
complete, and I don't have tests yet, but from my hand testing they seem usable.
I also added ./tools/serialize.js|py to both repositories.
So now I have:
- two parsers that produce the same JSON AST
- serializers that can take that AST and reproduce L20n
Which means that we should be able to freely interact between js and python and
also read/write L20n for tools purposes.
Axel, I also removed unescape dependency from JS Parser, so you should be able
to use it in Aisle.
Working on that brought three topics that I so far left unresolved:
1) Source notation. Currently both parsers don't store any information on
syntax nodes positioning in the source. I believe it would be worth figuring
out how we want to handle that. First idea that comes to mind is that we could
just add a kvp on the node object like 'source': {'start': 49, 'end': 102',
string: '...'} to use for an editor.
2) String notations. When a string is used it may be surrounded by ", ' or (in
the future) """ or '''. Once we parser id, we don't store this information so
on serialization we cannot reuse it.
We could guess (for example: multiline uses triple-quotes, single line uses "
unless it has " inside it, and no ' in which case it uses '), but we could also
somehow store it on the string
3) Unescaping.
Right now we do something very dummy - we unescape unicode and remove a quote
from in front of any other character treating the following char as
non-semantic.
It works well enough, you can do: <foo "hey \" ho"> or <foo "hey \{{ var }}
ho"> and it will all be stores as a simple string.
But with serialization, problems arise.
First, unicode \uXXXX will be turned into a unicode char by parser so the
serializer will have no way to figure out what form of unicode has been used
and will serialize it as a unicode char.
Second, there is no way to sometimes know what unescape form has been used.
Like:
<foo "hey \{{ var }}"> and <foo "hey {\{ var }}"> will produce the same AST.
During serialization we can identify that since the ast node is a simple string
"hey {{ var }}" and not a complex string, we should unescape the {{ to remove
the syntactic meaning, but we have no way to know which char should be
unescaped.
Third, all other chars just escaped, so <foo "hey \n"> will be turned into "hey
n" and <foo "hey \l"> will be turned into <foo "hey l">
That means that when serializing we will just write it back without a backslash.
We can limit the backslash use, and raise errors in parser if \ precedes an
unknown char, and then have rules in the serializer, to backslash a backslash,
backslash {{ and backslash string closing mark, but for chars like "\n" we will
hit the same problem as with unicode:
<foo "hey
ho"> and <foo hey \n ho"> will produce the same AST. What should we serialize
it into?
Would love to get your feedback!
zb.
[0]
https://github.com/l20n/python-l20n/blob/master/lib/l20n/format/serializer.py
[1]
https://github.com/zbraniecki/l20n.js/blob/v3-features/src/lib/format/l20n/serializer.js
_______________________________________________
tools-l10n mailing list
[email protected]
https://lists.mozilla.org/listinfo/tools-l10n