I've been playing with extracting images. I've found a few 'wierdnesses' (I know, that's not a real word) in the org.apache.pdfbox.ExtractText class and If I can clear some time, I'll try to submit something on that.
Ignoring the 'wierdnesses' (which have more to do with options parsing and filenaming), it does successfully extract images to separate files. However, the color table is apparently not being handled properly. All the images end up displaying with the default Windows palette, which tells me that they probably are missing their own. I assume that what probably needs to be done is that the color space needs to be rebuilt and reset on each image object prior to writing the image out to file, but I'm not entirely certain how to proceed with that. Does anybody have any familiarity with the PDXObjectImage and its related APIs? If someone can point me in the right direction, I don't mind doing the work of fixing this. Mel Dr. Mel Martinez [email protected]
