Andrew Knox writes:
> Dear All,
> I'm using plucker desktop 1.4.0.2 on windows.
> 
> How do I prevent plucker from displaying apostrophes as various strange
> characters?  (It usually looks like an a-circumflex, euro,
> th-superscript).

Since no one else has replied ...  I don't have a real answer,
but I have a workaround, since I've been seeing too many of these. 
This is a hack.  I haven't submitted it for inclusion because the
right solution involves referring to a complete and correct
character table.  There's probably a python library somewhere to do
that, though a quick google search for keywords like "python utf8 to
ascii" didn't turn up anything obvious. 

My hacky temporary workaround just checks for a few common characters
that I empirically see often in web pages.
In TextParser.py, I define a routine (which I have around line 668
just after _find_text_split):

    # unencode -- strip out codes the palm won't understand,
    # for things like smartquote characters.
    def unencode (self, line):
        line = string.replace (line, "\342\200\224", "--")
        line = string.replace (line, "\342\200\230", "`")
        line = string.replace (line, "\342\200\231", "'")
        line = string.replace (line, "\342\200\234", "\"")
        line = string.replace (line, "\342\200\235", "\"")
        return line

Then a few lines later, in add_text, add this line just after
the while 1: and before the new_size = self._approximate_size + len (line):

                line = self.unencode(line)

You can add new codes as you encounter them.  The easiest way to find
the offending codes is to wget the html file and load it in emacs.
vi shows the codes as characters, and od -xc doesn't line up the
characters and the numeric codes, but emacs shows numeric codes inline.

Question for the Plucker developers: was add_text a reasonable
place to put this?  I'm not that familiar with the parser
code -- is there a better place to add code like this?

A good URL for testing: http://riverbendblog.blogspot.com/

A simpler test (I was surprised to see ESR using "\342\200\224"
for emdashes): http://catb.org/~esr/faqs/smart-questions.html

        ...Akkana
_______________________________________________
plucker-list mailing list
[EMAIL PROTECTED]
http://lists.rubberchicken.org/mailman/listinfo/plucker-list

Reply via email to