Re: Path to start using l20n file format in Gaia and l20n format for v3

Zibi Braniecki Thu, 30 Apr 2015 17:36:07 -0700

Next update!

I got both python [0] and js [1] serializers to work! I can't say they are 
complete, and I don't have tests yet, but from my hand testing they seem usable.


I also added ./tools/serialize.js|py to both repositories.

So now I have:
 - two parsers that produce the same JSON AST
 - serializers that can take that AST and reproduce L20n

Which means that we should be able to freely interact between js and python and 
also read/write L20n for tools purposes.
Axel, I also removed unescape dependency from JS Parser, so you should be able 
to use it in Aisle.

Working on that brought three topics that I so far left unresolved:

1) Source notation. Currently both parsers don't store any information on 
syntax nodes positioning in the source. I believe it would be worth figuring 
out how we want to handle that. First idea that comes to mind is that we could 
just add a kvp on the node object like 'source': {'start': 49, 'end': 102', 
string: '...'} to use for an editor.

2) String notations. When a string is used it may be surrounded by ", ' or (in 
the future) """ or '''. Once we parser id, we don't store this information so 
on serialization we cannot reuse it.

We could guess (for example: multiline uses triple-quotes, single line uses " 
unless it has " inside it, and no ' in which case it uses '), but we could also 
somehow store it on the string

3) Unescaping.

Right now we do something very dummy - we unescape unicode and remove a quote 
from in front of any other character treating the following char as 
non-semantic.

It works well enough, you can do: <foo "hey \" ho"> or <foo "hey \{{ var }} 
ho"> and it will all be stores as a simple string.

But with serialization, problems arise.

First, unicode \uXXXX will be turned into a unicode char by parser so the 
serializer will have no way to figure out what form of unicode has been used 
and will serialize it as a unicode char.

Second, there is no way to sometimes know what unescape form has been used. 
Like:

<foo "hey \{{ var }}"> and <foo "hey {\{ var }}"> will produce the same AST. 
During serialization we can identify that since the ast node is a simple string 
"hey {{ var }}" and not a complex string, we should unescape the {{ to remove 
the syntactic meaning, but we have no way to know which char should be 
unescaped.

Third, all other chars just escaped, so <foo "hey \n"> will be turned into "hey 
n" and <foo "hey \l"> will be turned into <foo "hey l">

That means that when serializing we will just write it back without a backslash.

We can limit the backslash use, and raise errors in parser if \ precedes an 
unknown char, and then have rules in the serializer, to backslash a backslash, 
backslash {{ and backslash string closing mark, but for chars like "\n" we will 
hit the same problem as with unicode:

<foo "hey
 ho"> and <foo hey \n ho"> will produce the same AST. What should we serialize 
it into?

Would love to get your feedback!
zb.

[0] 
https://github.com/l20n/python-l20n/blob/master/lib/l20n/format/serializer.py
[1] 
https://github.com/zbraniecki/l20n.js/blob/v3-features/src/lib/format/l20n/serializer.js
_______________________________________________
tools-l10n mailing list
[email protected]
https://lists.mozilla.org/listinfo/tools-l10n

Re: Path to start using l20n file format in Gaia and l20n format for v3

Reply via email to