On Wed, May 24, 2017 at 10:11 AM, Henri Sivonen <hsivo...@hsivonen.fi> wrote:
>> Our current interface is UTF-16, so that's my target for now. I think
>> whatever cache-friendliness would be lost converting from UTF-16 -> UTF-8 ->
>> UTF-16.
>
> I hope this can be reconsidered, because the assumption that it would
> have to be UTF-16 -> UTF-8 -> UTF-16 isn't accurate.

I see that this part didn't get an on-list reply but got an blog reply:
http://www.erahm.org/2017/05/24/a-rust-based-xml-parser-for-firefox/

I continue to think it's a bad idea to write another parser that uses
UTF-16 internally. Even though I can see your desire to keep the
project tightly scoped, I think it's fair to ask you to expand the
scope a bit by 1) adding a way to pass Latin-1 data to text nodes
directly (and use this when the the parser sees a text node is all
ASCII) and 2) replacing nsScanner with a bit of new buffering code
that takes bytes from the network and converts them to UTF-8 using
encoding_rs.

We've both had the displeasure of modifying nsScanner as part of a
security fix. nsScanner isn't valuable code that we should try to
keep. It's no longer scanning for anything. It's just an
over-complicated way of maintaining a buffer of UTF-16 data. While
nsScanner and the associated classes are a lot of code, they do
something simple that should be done in quite a bit less code, so as
scope creep, replacing nsScanner should be a drop in a bucket
effort-wise compared to replacing expat.

I think it's super-sad if we get another UTF-16-using parser because
replacing nsScanner was scoped out of the project.

-- 
Henri Sivonen
hsivo...@hsivonen.fi
https://hsivonen.fi/
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to