On Wed, May 24, 2017 at 10:11 AM, Henri Sivonen <hsivo...@hsivonen.fi> wrote: >> Our current interface is UTF-16, so that's my target for now. I think >> whatever cache-friendliness would be lost converting from UTF-16 -> UTF-8 -> >> UTF-16. > > I hope this can be reconsidered, because the assumption that it would > have to be UTF-16 -> UTF-8 -> UTF-16 isn't accurate.
I see that this part didn't get an on-list reply but got an blog reply: http://www.erahm.org/2017/05/24/a-rust-based-xml-parser-for-firefox/ I continue to think it's a bad idea to write another parser that uses UTF-16 internally. Even though I can see your desire to keep the project tightly scoped, I think it's fair to ask you to expand the scope a bit by 1) adding a way to pass Latin-1 data to text nodes directly (and use this when the the parser sees a text node is all ASCII) and 2) replacing nsScanner with a bit of new buffering code that takes bytes from the network and converts them to UTF-8 using encoding_rs. We've both had the displeasure of modifying nsScanner as part of a security fix. nsScanner isn't valuable code that we should try to keep. It's no longer scanning for anything. It's just an over-complicated way of maintaining a buffer of UTF-16 data. While nsScanner and the associated classes are a lot of code, they do something simple that should be done in quite a bit less code, so as scope creep, replacing nsScanner should be a drop in a bucket effort-wise compared to replacing expat. I think it's super-sad if we get another UTF-16-using parser because replacing nsScanner was scoped out of the project. -- Henri Sivonen hsivo...@hsivonen.fi https://hsivonen.fi/ _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform