Hello.

My sister is an iBook user who owns a Tungsten E2 palm. She is a
doctor, and wants to have some way of creating notes in her computer
and transfer them to the palm for usage during work when she is
visiting patients and doesn't have access to the computer. This
wouldn't be difficult if it weren't for the fact that sometimes
she needs to extracts images from other medical papers provided
only in .pdf form.

I remember having used plucker ages ago when I got a Palm III,
so I tried using that because as AFAIK it's the only software I
know which allows to create browsable and searchable documents in
a nice compact way including images.

Right now, she has been using Apple's Pages software to create
her documents, and this software has an export as HTML option. I
decided to create a directory where she would create the exported
HTML versions, then run desktop plucker to pick up whatever file
she had created and presto. But I have faced a few problems.

I am not attaching the problematic files here due to size reasons. I
have uploaded them to http://gradha.sdf-eu.org/plucker.tar.bz2, 262
KiB, where you can get all the files I'll be referencing from now on.

First of all, there are internationalisation problems. The
document I tried to export has the name Beçet. When I added the
exported html file (bad/Behçet.html), plucker generated the log
PlkrLog_2006-06-25_145515.txt, whose main interesting part is:

 Processing file:/Users/karolina/Documents/Plucker/Behçet.html...
   Retrieval failed: 404 -- [Errno 22] Invalid argument: 
'/Users/karolina/Documents/Plucker/Beh\xe7et.html'.
 Error:  Fetching the home document failed.  Aborting all!

Looks like plucker doesn't like non ascii characters. Oh
well. I renamed the French letter (bad/Behset.html) and
tried again.  This time, plucker died parsing the HTML
(PlkrLog_2006-06-25_145629-i.txt):

 Processing file:/Users/karolina/Documents/Plucker/B.....roppedImage.png...
   Retrieved ok.
 /Applications/Plucker.app/Contents/Resources/parser/python/vm/PIL/Image.py:53: 
RuntimeWarning: Python C API version mismatch for module _imaging: This Python 
has API ver
   import _imaging
 Error:  Unknown error parsing document 
file:/Users/karolina/Documents/Plucker/Behset_files/droppedImage.png:
 Traceback (most recent call last):
   File 
"/Applications/Plucker.app/Contents/Resources/parser/python/PyPlucker/Parser.py",
 line 46, in generic_parser
     return parsed.get_plucker_doc ()
   File 
"/Applications/Plucker.app/Contents/Resources/parser/python/PyPlucker/ImageParser.py",
 line 210, in get_plucker_doc
     (width, height, depth, section), limits, scaling_factor = 
self.calculate_desired_size()
   File 
"/Applications/Plucker.app/Contents/Resources/parser/python/PyPlucker/ImageParser.py",
 line 142, in calculate_desired_size
     width = int(w)
 ValueError: invalid literal for int(): 481.00pt
   Parsing failed.

While this time plucker had created some sort of text only version
which I could read in the palm, the images were not there, surely
because the parsing of the pt size unit is giving plucker troubles. I
have uploaded the Behset.html file to http://validator.w3.org/check
and the validator accepts it as valid XHTML 1.0 transitional,
even though the tidy software also complains about the pt size units.

Not giving up, I manually edited the file to remove all pt size
suffixes and replucked the file.  This time, all the processing
went without errors and I could see images... but all of them
were black!  The pdf extractions are exported by Pages as png
with some transparent background color or something like that. The
plucker conversion takes the images and fills the transparent color
with black, and well, black over black doesn't look very good. In
fact, the only traits of text I saw on the images were the letters
antialiasing artifacts which were a shade of grey.

Finally, my sister writes here texts in Spanish or English. Pages
exports the non ascii characters as unicode escape sequences in
the form &#xblahblah; Plucker doesn't like this, and instead of
showing the correct character in the viewer, the escape sequence
is displayed. After some head scratching I processed the HTML with
tidy to force a conversion of escape sequences to latin1, and voilà,
Spanish characters do show up properly now.

Well, time to roll own my own Pages to plucker conversor then. To
address all the problems outlined above I created a python
script. The idea would be to introduce an extra step in my sister's
plucking process. After the html exportation she would run this
script, which would convert all the html to a format which plucker
processes in a better way.

The script file (fix-plucker-vs-pages.py) does:

  1. Recode the HTML with tidy to use latin1 characters instead of
     unicode escape sequences.
  2. Remove the pt image size suffixes.
  3. Process with PIL all the images and recreate them without
     alpha channel and a white background.

The result of a file processed with this script is included in
the root of the package (Behset.html). However, at the point I
got this working, I couldn't use it on the Mac because PIL is not
distributed and I didn't know what to download to get it as I do
in Linux. My sister had to go (she was on a short holiday trip)
and I won't see here again for a few months (no internet too).

I would welcome any suggestion about PIL on MacOSX. Other than
that, I think it would be good if desktop plucker didn't have these
mostly silly errors. The image thing is more of a specific thing,
which could also happen with transparent gifs, I guess. I see
that plucker desktop has an option to specify the background of
the text. Hopefully the images could use this value (or whatever
background color of the image) to show them properly in the palm.

Again, here's what comes in the .tar.bz mentioned above:

  Behçet.pages

    Original pages document as saved by the Apple Pages software in
    bundled format. My sister says it is ok for you to distribute
    the file and use in any form. She wrote the text, and the small
    extractions probably fall under copyright's fair use policy.

  bad/Behçet_files
  bad/Behçet.html
  bad/PlkrLog_2006-06-25_145515.txt

    First problem, parsing of non ascii characters.

  bad/Behset_files
  bad/Behset.html
  bad/PlkrLog_2006-06-25_145629-i.txt

    Second problem, parsing of pt image unit size.

  fix-plucker-vs-pages.py

    The python script I wrote.

  Behset.html
  Behset_files

    The Pages HTML exportation after being run through the script
    above, with correct images and stuff, ready to be plucked.

My development time is measured in minutes per month (see the date
of the plucker logs, it took me that long just to subscribe and
write this mail!), so I would like to know if you plan on fixing
this in your software without requiring me to go deep into the
source code and send patches. And if you don't plan to, I would
like to know that too.
_______________________________________________
plucker-dev mailing list
plucker-dev@rubberchicken.org
http://lists.rubberchicken.org/mailman/listinfo/plucker-dev

Reply via email to