Our mission today is to use Basex to remove tags injected right between
the bytes of multibyte UTF-8 characters.

http://www.couchsurfing.org/group_read.html?gid=430&post=13986932

>>>>> "CG" == Christian Grün <christian.gr...@gmail.com> writes:
CG> Have you tried method=raw, as mentioned in our documentation
CG> (http://docs.basex.org/wiki/Serialization)?

Sorry. Try it yourself:
echo '<A>你好</A>'|perl -pwle 's![^[:ascii:]]!$&<wbr/>!'|basex -q '
      declare option db:parser "html";
      declare option output:method "raw";
      doc("/dev/stdin")//*:wbr/..'

There is no way to cleanly restore the shattered UTF-8.

I would also like to try

      declare option output:encoding "RAW"; or "BYTES" or "NONE"

but on
http://docs.basex.org/wiki/Serialization
it just says
"all encodings supported by Java"
So one is supposed to look at
http://www.google.com/search?q=all+encodings+supported+by+Java
etc. etc.

Why doesn't basex have a command that would output the current
"all encodings supported by Java"
that it is using.
_______________________________________________
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Reply via email to