On 5/1/15 2:35 AM, Zibi Braniecki wrote:
Next update!
I got both python [0] and js [1] serializers to work! I can't say they are
complete, and I don't have tests yet, but from my hand testing they seem usable.
I also added ./tools/serialize.js|py to both repositories.
So now I have:
- two parsers that produce the same JSON AST
- serializers that can take that AST and reproduce L20n
Which means that we should be able to freely interact between js and python and
also read/write L20n for tools purposes.
Axel, I also removed unescape dependency from JS Parser, so you should be able
to use it in Aisle.
Working on that brought three topics that I so far left unresolved:
1) Source notation. Currently both parsers don't store any information on
syntax nodes positioning in the source. I believe it would be worth figuring
out how we want to handle that. First idea that comes to mind is that we could
just add a kvp on the node object like 'source': {'start': 49, 'end': 102',
string: '...'} to use for an editor.
Maybe look at what treehugger does via setAnnotation?
https://github.com/ajaxorg/treehugger/blob/master/lib/treehugger/tree.js
2) String notations. When a string is used it may be surrounded by ", ' or (in the future)
""" or '''. Once we parser id, we don't store this information so on serialization
we cannot reuse it.
We could guess (for example: multiline uses triple-quotes, single line uses " unless
it has " inside it, and no ' in which case it uses '), but we could also somehow
store it on the string
3) Unescaping.
Right now we do something very dummy - we unescape unicode and remove a quote
from in front of any other character treating the following char as
non-semantic.
It works well enough, you can do: <foo "hey \" ho"> or <foo "hey \{{ var }}
ho"> and it will all be stores as a simple string.
But with serialization, problems arise.
First, unicode \uXXXX will be turned into a unicode char by parser so the
serializer will have no way to figure out what form of unicode has been used
and will serialize it as a unicode char.
Second, there is no way to sometimes know what unescape form has been used.
Like:
<foo "hey \{{ var }}"> and <foo "hey {\{ var }}"> will produce the same AST. During
serialization we can identify that since the ast node is a simple string "hey {{ var }}" and not a complex
string, we should unescape the {{ to remove the syntactic meaning, but we have no way to know which char should be
unescaped.
Third, all other chars just escaped, so <foo "hey \n"> will be turned into "hey n" and <foo "hey
\l"> will be turned into <foo "hey l">
That means that when serializing we will just write it back without a backslash.
We can limit the backslash use, and raise errors in parser if \ precedes an unknown char,
and then have rules in the serializer, to backslash a backslash, backslash {{ and
backslash string closing mark, but for chars like "\n" we will hit the same
problem as with unicode:
<foo "hey
ho"> and <foo hey \n ho"> will produce the same AST. What should we serialize
it into?
I'm generally on the "be an editor" front.
One algorithm for the serializer could be to minimize the textual diff
between the existing content in the file and the serialized output. In
particular for unchanged entities, that'd result in no change in the
text on disk.
Yeah, my editor-writing self doesn't believe in parsing and serializing,
I'm sorry.
Axel
Would love to get your feedback!
zb.
[0]
https://github.com/l20n/python-l20n/blob/master/lib/l20n/format/serializer.py
[1]
https://github.com/zbraniecki/l20n.js/blob/v3-features/src/lib/format/l20n/serializer.js
_______________________________________________
tools-l10n mailing list
[email protected]
https://lists.mozilla.org/listinfo/tools-l10n