As Liam indicated (thanks!), XQuery may not be the best choice to process data on byte level: XQuery was built to work with Unicode characters as basic unit, which means that it will never be possible with pure XQuery to create illegal UTF8 sequences. This also means that the language provides no support to „repair” invalid input.
I wonder if you have enough control over your input to avoid UTF8 shattering? If there’s no choice, and if you still want to try XQuery/BaseX for byte processing, you can play around with the functions of the Conversion Module: http://docs.basex.org/wiki/Conversion_Module ___________________________ On Tue, Jan 1, 2013 at 5:50 AM, <jida...@jidanni.org> wrote: >>>>>> "LREQ" == Liam R E Quin <l...@w3.org> writes: > LREQ> Treating the individual UTF-8 octets individually? > Yes. > LREQ> Not in standard XQuery, but that doesn't preclude a BaseX extension... > Well no big deal, I was just curious. >>> I was just curious if there was a way in basex if I could do s!<wbr/>!!g >>> like I can do in perl, to restore the damaged UTF-8 characters. > > LREQ> Note that "damaged UTF-8 characters", if by that you mean not > LREQ> well-formed UTF-8, aren't going to come through email reliably, so I > LREQ> might not be seeing what you wrote - s!<wbr/>!!g can be done with > > Don't worry. I wouldn't put any illegal chars into mail. > > LREQ> replace() but getting at UTF-8-encoded characters one octet at a time is > LREQ> another matter. But, my goal in replying was to tease out enough > LREQ> information from you that someone else could answer :-) > >>> http://www.couchsurfing.org/group_read.html?gid=430&post=13998575 > LREQ> This says, "this thread has been deleted" at me. > In fact they deleted the entire group it turns out. > > Anyway here's what I posted there > #!/usr/bin/perl > # Shows line where we remove couchsurfing.org's UTF-8 shattering effects. > # Must run this before the browser gets its hands on it and turns the > # shattered UTF-8 into U+FFFD REPLACEMENT CHARACTER. > # So that seems to count out greasemonkey, etc. solutions. > # I used wwwoffle -o URL|./this_program after first browsing the page logged > in > # in a browser that used wwwoffle as a proxy > # Copyright : http://www.fsf.org/copyleft/gpl.html > # Author : Dan Jacobson -- http://jidanni.org/ > # Created On : 12/31/2012 > # Last Modified On: Mon Dec 31 13:12:57 2012 > # Update Count : 27 > use strict; > use warnings FATAL => 'all'; > my $N = qr/[^[:ascii:]]/; > while (<>) { > my $original_line = $_; > ## needed on e.g., http://www.couchsurfing.org/couchmanager?read=18541584 > s!<wbr/>!!g; > ## needed on e.g., > ## > http://www.couchsurfing.org/couchrequest/show_couchoffer_form?city_couchrequest=1223052 > s!($N) ($N)!$1$2!g; > s!\t<span class="show_more_control">\s+<br />!! && chomp; > m!^\s+...<a class="show_more_link" href="#"> \(more\) </a><br />! && next; > s!\s*</span><span class="show_more_text" style="display: none;"> !!; > print "$.: $_" if $_ ne $original_line; > } _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk