On 5/1/15 2:35 AM, Zibi Braniecki wrote:
Next update!

I got both python [0] and js [1] serializers to work! I can't say they are 
complete, and I don't have tests yet, but from my hand testing they seem usable.

I also added ./tools/serialize.js|py to both repositories.

So now I have:
  - two parsers that produce the same JSON AST
  - serializers that can take that AST and reproduce L20n

Which means that we should be able to freely interact between js and python and 
also read/write L20n for tools purposes.
Axel, I also removed unescape dependency from JS Parser, so you should be able 
to use it in Aisle.

Working on that brought three topics that I so far left unresolved:

1) Source notation. Currently both parsers don't store any information on 
syntax nodes positioning in the source. I believe it would be worth figuring 
out how we want to handle that. First idea that comes to mind is that we could 
just add a kvp on the node object like 'source': {'start': 49, 'end': 102', 
string: '...'} to use for an editor.
Maybe look at what treehugger does via setAnnotation? https://github.com/ajaxorg/treehugger/blob/master/lib/treehugger/tree.js

2) String notations. When a string is used it may be surrounded by ", ' or (in the future) 
""" or '''. Once we parser id, we don't store this information so on serialization 
we cannot reuse it.

We could guess (for example: multiline uses triple-quotes, single line uses " unless 
it has " inside it, and no ' in which case it uses '), but we could also somehow 
store it on the string

3) Unescaping.

Right now we do something very dummy - we unescape unicode and remove a quote 
from in front of any other character treating the following char as 
non-semantic.

It works well enough, you can do: <foo "hey \" ho"> or <foo "hey \{{ var }} 
ho"> and it will all be stores as a simple string.

But with serialization, problems arise.

First, unicode \uXXXX will be turned into a unicode char by parser so the 
serializer will have no way to figure out what form of unicode has been used 
and will serialize it as a unicode char.

Second, there is no way to sometimes know what unescape form has been used. 
Like:

<foo "hey \{{ var }}"> and <foo "hey {\{ var }}"> will produce the same AST. During 
serialization we can identify that since the ast node is a simple string "hey {{ var }}" and not a complex 
string, we should unescape the {{ to remove the syntactic meaning, but we have no way to know which char should be 
unescaped.

Third, all other chars just escaped, so <foo "hey \n"> will be turned into "hey n" and <foo "hey 
\l"> will be turned into <foo "hey l">

That means that when serializing we will just write it back without a backslash.

We can limit the backslash use, and raise errors in parser if \ precedes an unknown char, 
and then have rules in the serializer, to backslash a backslash, backslash {{ and 
backslash string closing mark, but for chars like "\n" we will hit the same 
problem as with unicode:

<foo "hey
  ho"> and <foo hey \n ho"> will produce the same AST. What should we serialize 
it into?
I'm generally on the "be an editor" front.

One algorithm for the serializer could be to minimize the textual diff between the existing content in the file and the serialized output. In particular for unchanged entities, that'd result in no change in the text on disk.

Yeah, my editor-writing self doesn't believe in parsing and serializing, I'm sorry.

Axel

Would love to get your feedback!
zb.

[0] 
https://github.com/l20n/python-l20n/blob/master/lib/l20n/format/serializer.py
[1] 
https://github.com/zbraniecki/l20n.js/blob/v3-features/src/lib/format/l20n/serializer.js

_______________________________________________
tools-l10n mailing list
[email protected]
https://lists.mozilla.org/listinfo/tools-l10n

Reply via email to