Re: Strange characters

Akkana Peck Sun, 19 Sep 2004 18:41:41 -0700

Brian Wanamaker writes:
> >...despite appearing normal on the web IE6, Firefox, and under OS 
> >X/Safari, on my Palm it gets artifacted characters in place of 
> >curly-quotes and "real" apostrophes (not double-primes and 
> >single-prime, respectively). He says he writes his pieces on a Mac, 
> >under Works or Word, which auto-curly-quotes, then cut-and-pastes it 
> >into Blogger. It displays fine in browsers, but Plucker displays 
> >something with a euro-denomination character in it.

I hit this problem all the time, on lots of different pages (using
the python parser).  I've looked for "the right" way to do this in 
python, so that I could contribute a fix, but haven't found any
python utf-8-to-ascii translation libraries, or even a comprehensive
table of utf-8 values so that I can write such a library.

So I've written a hack to the python parser to translate the most
common characters I see in web pages, and when I get a page that
has a lot of some new character, I add that character to the list.
Not a good solution, but if you have one page you pluck often,
it'll work for that.  See the code at the end of this message.

David A. Desrosiers writes:
>       I then tried to pluck it with the Python distiller, passing in 
> the right charset, and it fails to parse it properly. I think that 
> might be a bug.

For pages on which the distiller doesn't fail, what would be the
right way to do it?

For example, here's a page that uses a lot of three-character
sequences for things like ellipsis and emdash.  I tried specifying
a charset, but it didn't help.  I ran this command:

plucker-build -H http://riverbendblog.blogspot.com/ -N "Riverbend"
-f riverbend --noimages --stayonhost --zlib-compression --maxdepth 1
--charset=utf-8

But the result still displays on the Palm with lots of 3-char
sequences like It<a-hat><Euro><trademark>s instead of It's.

With the following hack, I see It's, no a-hat or euro or TM.

First, around line 882 (just before def add_text) I add this
(and this is where you can add additional sequences for pages
you pluck often):

    # "unencode" -- strip out codes the palm won't understand,
    # for things like smartquote characters.
    def unencode (self, line):
        line = string.replace (line, "\342\200\224", "--")
        line = string.replace (line, "\342\200\230", "`")
        line = string.replace (line, "\342\200\231", "'")
        line = string.replace (line, "\342\200\234", "\"")
        line = string.replace (line, "\342\200\235", "\"")
        line = string.replace (line, "\342\200\246", "...")
        return line

Then, a few lines down in add_text, just after the "while 1", add
a call to unencode:

            while 1:
                line = self.unencode(line)
                new_size = self._approximate_size + len (line)

Hope this helps!  Sorry it's such a hack; I'd love to hear of a
better and more comprehensive solution.

        ...Akkana
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Re: Strange characters

Reply via email to