As Liam indicated (thanks!), XQuery may not be the best choice to
process data on byte level: XQuery was built to work with Unicode
characters as basic unit, which means that it will never be possible
with pure XQuery to create illegal UTF8 sequences. This also means
that the language provides no support to „repair” invalid input.

I wonder if you have enough control over your input to avoid UTF8
shattering? If there’s no choice, and if you still want to try
XQuery/BaseX for byte processing, you can play around with the
functions of the Conversion Module:

  http://docs.basex.org/wiki/Conversion_Module
___________________________

On Tue, Jan 1, 2013 at 5:50 AM,  <jida...@jidanni.org> wrote:
>>>>>> "LREQ" == Liam R E Quin <l...@w3.org> writes:
> LREQ> Treating the individual UTF-8 octets individually?
> Yes.
> LREQ> Not in standard XQuery, but that doesn't preclude a BaseX extension...
> Well no big deal, I was just curious.
>>> I was just curious if there was a way in basex if I could do s!<wbr/>!!g
>>> like I can do in perl, to restore the damaged UTF-8 characters.
>
> LREQ> Note that "damaged UTF-8 characters", if by that you mean not
> LREQ> well-formed UTF-8, aren't going to come through email reliably, so I
> LREQ> might not be seeing what you wrote - s!<wbr/>!!g can be done with
>
> Don't worry. I wouldn't put any illegal chars into mail.
>
> LREQ> replace() but getting at UTF-8-encoded characters one octet at a time is
> LREQ> another matter. But, my goal in replying was to tease out enough
> LREQ> information from you that someone else could answer :-)
>
>>> http://www.couchsurfing.org/group_read.html?gid=430&post=13998575
> LREQ> This says, "this thread has been deleted" at me.
> In fact they deleted the entire group it turns out.
>
> Anyway here's what I posted there
> #!/usr/bin/perl
> # Shows line where we remove couchsurfing.org's UTF-8 shattering effects.
> # Must run this before the browser gets its hands on it and turns the
> # shattered UTF-8 into U+FFFD REPLACEMENT CHARACTER.
> # So that seems to count out greasemonkey, etc. solutions.
> # I used wwwoffle -o URL|./this_program after first browsing the page logged 
> in
> # in a browser that used wwwoffle as a proxy
> # Copyright       : http://www.fsf.org/copyleft/gpl.html
> # Author          : Dan Jacobson -- http://jidanni.org/
> # Created On      : 12/31/2012
> # Last Modified On: Mon Dec 31 13:12:57 2012
> # Update Count    : 27
> use strict;
> use warnings FATAL => 'all';
> my $N = qr/[^[:ascii:]]/;
> while (<>) {
>     my $original_line = $_;
> ## needed on e.g., http://www.couchsurfing.org/couchmanager?read=18541584
>     s!<wbr/>!!g;
> ## needed on e.g.,
> ## 
> http://www.couchsurfing.org/couchrequest/show_couchoffer_form?city_couchrequest=1223052
>     s!($N) ($N)!$1$2!g;
>     s!\t<span class="show_more_control">\s+<br />!! && chomp;
>     m!^\s+...<a class="show_more_link" href="#"> \(more\) </a><br />! && next;
>     s!\s*</span><span class="show_more_text" style="display: none;"> !!;
>     print "$.: $_" if $_ ne $original_line;
> }
_______________________________________________
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Reply via email to