In the interest of self criticism, I thought I'd post some of the things that I think need work in the new content parser implementation:
1. The CMapAwareDocumentFont, I suspect, is an ugly, ugly hack that would be better handled in a different way. It seems to me that almost all fonts will have CMap type behavior - the ToUnicode tag is just one of those. Perhaps I should have named the class ToUnicodeAwareDocumentFont - but I think that ultimately the correct solution here is going to be to build a CMap equivalent for any font object, regardless of whether ToUnicode is being used, or if the Font's internal CMap is being used. I wonder if incorporating the actual CMap object into the fonts might be the way to approach this. What makes this a bit tricky is that the iText font class implementations are really geared more towards PDF generation than PDF parsing (completely expected) - I don't particularly want to introduce code to those classes just to support parsing in the small # of scenarios where it will be occuring - but it seems like the right approach... Along these lines, I know for certain that the current extractor does not properly handle PDFs generated by MS Word that use forward and backwards ticks (font is a TrueType TimesNewRomanPSMT, with WinAnsiEncoding encoding). There is no ToUnicode map for these fonts, but the unicode that is used in the encoding doesn't render out to the actual tick marks (I get a trademark (TM) symbol for the tick, for example). I suspect that the problem is that I am just blindly assuming that the encoding is standard unicode if a ToUnicode map is specified. 2. Spatial analysis is fairly limited right now. For example, if content appears early on the page, but later in the content stream, the order of the extracted text is not consistent with the visual representation. For our particular use of the parser, this is not an issue - but I could see where it might be important. Fixing this would be relatively easy - tag each output line with it's Y position, then order the array before converting to a string. 3. Vertical orientation of text is not handled at all. At this point, I'm not entirely sure how to even detect that a content stream is performing vertical rendering (maybe this is part of the font metrics??) 4. Content included from external objects may not be handled properly. The canonical example here is adding Page X of Y to the bottom of each PDF page. The value for 'Y' is added as an external XObject. I have done no testing with this, but it's quite possible that the reconstruction of the phrase 'Page 3 of 7' might not work properly here. We might get 'Page 3' in one place in the text, then '7' in another. This comes back to spatial analysis. 5. The algorithm for determing word separation is not as robust as it should be. For example, if the font doesn't specify a width for character 32, the algorithm fails entirely. Also, is dividing the char 32 width by 2 appropriate? And what character/word spacing adjustments should really be made to that width? 6. Is the overall architecture of the parser appropriate? Specifically, is passing in the ending text matrix to displayText() the best way to achieve the goal of detecting whether the next string is part of the previous string or not? 7. Is the use of floats appropriate, or should we be using int (or long) and scale everything by 1000? I used float primarily b/c I was concerned about overflow of the Matrix entries - but the current implementation is certainly slower than it could be. 8. Are there any gross errors being made in reading objects from the PdfReader? For example, have I made any mistakes in terms of loading the content stream in PdfTextExtractor#getContentBytesForPage()? How about how I read the resource dictionary in PdfTextExtractor#getTextFromPage() - should I be doing anything to ensure that these resources don't consume memory after the page has been processed? 9. How should unit testing be configured for this functionality? It seems like we will wind up needing to tune some of these algorithms as users find situations where the text parsing doesn't work properly, and I think it's important to ensure that this does not break things. I'm currently thinking something along the lines of having a test documents folder that contains a PDF and a .txt file containing the extracted results. The test would then go through and ensure that the extraction matches. Not sure how best to handle multiple pages with this. Maybe a separate .txt file for each page? Any and all comments/feedback are welcome. - K ------------------------------------------------------------------------- This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions Buy the iText book: http://www.1t3xt.com/docs/book.php
