Hello. My sister is an iBook user who owns a Tungsten E2 palm. She is a doctor, and wants to have some way of creating notes in her computer and transfer them to the palm for usage during work when she is visiting patients and doesn't have access to the computer. This wouldn't be difficult if it weren't for the fact that sometimes she needs to extracts images from other medical papers provided only in .pdf form.
I remember having used plucker ages ago when I got a Palm III, so I tried using that because as AFAIK it's the only software I know which allows to create browsable and searchable documents in a nice compact way including images. Right now, she has been using Apple's Pages software to create her documents, and this software has an export as HTML option. I decided to create a directory where she would create the exported HTML versions, then run desktop plucker to pick up whatever file she had created and presto. But I have faced a few problems. I am not attaching the problematic files here due to size reasons. I have uploaded them to http://gradha.sdf-eu.org/plucker.tar.bz2, 262 KiB, where you can get all the files I'll be referencing from now on. First of all, there are internationalisation problems. The document I tried to export has the name Beçet. When I added the exported html file (bad/Behçet.html), plucker generated the log PlkrLog_2006-06-25_145515.txt, whose main interesting part is: Processing file:/Users/karolina/Documents/Plucker/Behçet.html... Retrieval failed: 404 -- [Errno 22] Invalid argument: '/Users/karolina/Documents/Plucker/Beh\xe7et.html'. Error: Fetching the home document failed. Aborting all! Looks like plucker doesn't like non ascii characters. Oh well. I renamed the French letter (bad/Behset.html) and tried again. This time, plucker died parsing the HTML (PlkrLog_2006-06-25_145629-i.txt): Processing file:/Users/karolina/Documents/Plucker/B.....roppedImage.png... Retrieved ok. /Applications/Plucker.app/Contents/Resources/parser/python/vm/PIL/Image.py:53: RuntimeWarning: Python C API version mismatch for module _imaging: This Python has API ver import _imaging Error: Unknown error parsing document file:/Users/karolina/Documents/Plucker/Behset_files/droppedImage.png: Traceback (most recent call last): File "/Applications/Plucker.app/Contents/Resources/parser/python/PyPlucker/Parser.py", line 46, in generic_parser return parsed.get_plucker_doc () File "/Applications/Plucker.app/Contents/Resources/parser/python/PyPlucker/ImageParser.py", line 210, in get_plucker_doc (width, height, depth, section), limits, scaling_factor = self.calculate_desired_size() File "/Applications/Plucker.app/Contents/Resources/parser/python/PyPlucker/ImageParser.py", line 142, in calculate_desired_size width = int(w) ValueError: invalid literal for int(): 481.00pt Parsing failed. While this time plucker had created some sort of text only version which I could read in the palm, the images were not there, surely because the parsing of the pt size unit is giving plucker troubles. I have uploaded the Behset.html file to http://validator.w3.org/check and the validator accepts it as valid XHTML 1.0 transitional, even though the tidy software also complains about the pt size units. Not giving up, I manually edited the file to remove all pt size suffixes and replucked the file. This time, all the processing went without errors and I could see images... but all of them were black! The pdf extractions are exported by Pages as png with some transparent background color or something like that. The plucker conversion takes the images and fills the transparent color with black, and well, black over black doesn't look very good. In fact, the only traits of text I saw on the images were the letters antialiasing artifacts which were a shade of grey. Finally, my sister writes here texts in Spanish or English. Pages exports the non ascii characters as unicode escape sequences in the form lahblah; Plucker doesn't like this, and instead of showing the correct character in the viewer, the escape sequence is displayed. After some head scratching I processed the HTML with tidy to force a conversion of escape sequences to latin1, and voilà, Spanish characters do show up properly now. Well, time to roll own my own Pages to plucker conversor then. To address all the problems outlined above I created a python script. The idea would be to introduce an extra step in my sister's plucking process. After the html exportation she would run this script, which would convert all the html to a format which plucker processes in a better way. The script file (fix-plucker-vs-pages.py) does: 1. Recode the HTML with tidy to use latin1 characters instead of unicode escape sequences. 2. Remove the pt image size suffixes. 3. Process with PIL all the images and recreate them without alpha channel and a white background. The result of a file processed with this script is included in the root of the package (Behset.html). However, at the point I got this working, I couldn't use it on the Mac because PIL is not distributed and I didn't know what to download to get it as I do in Linux. My sister had to go (she was on a short holiday trip) and I won't see here again for a few months (no internet too). I would welcome any suggestion about PIL on MacOSX. Other than that, I think it would be good if desktop plucker didn't have these mostly silly errors. The image thing is more of a specific thing, which could also happen with transparent gifs, I guess. I see that plucker desktop has an option to specify the background of the text. Hopefully the images could use this value (or whatever background color of the image) to show them properly in the palm. Again, here's what comes in the .tar.bz mentioned above: Behçet.pages Original pages document as saved by the Apple Pages software in bundled format. My sister says it is ok for you to distribute the file and use in any form. She wrote the text, and the small extractions probably fall under copyright's fair use policy. bad/Behçet_files bad/Behçet.html bad/PlkrLog_2006-06-25_145515.txt First problem, parsing of non ascii characters. bad/Behset_files bad/Behset.html bad/PlkrLog_2006-06-25_145629-i.txt Second problem, parsing of pt image unit size. fix-plucker-vs-pages.py The python script I wrote. Behset.html Behset_files The Pages HTML exportation after being run through the script above, with correct images and stuff, ready to be plucked. My development time is measured in minutes per month (see the date of the plucker logs, it took me that long just to subscribe and write this mail!), so I would like to know if you plan on fixing this in your software without requiring me to go deep into the source code and send patches. And if you don't plan to, I would like to know that too. _______________________________________________ plucker-dev mailing list plucker-dev@rubberchicken.org http://lists.rubberchicken.org/mailman/listinfo/plucker-dev