> ok - I just committed code with a new getCharacterRenderInfos() method. I > have one test case as a sanity check, but definitely let me know if you > have > additional use cases that are failing, etc... (providing a test case helps > immensely if that does happen). > > I have removed the character and word spacing getters for the time being > (they really shouldn't be necessary now that we have per-glyph metrics). > > Let me know what you guys think,
Hello, I'd like to thank you all for doing such a great job with iText and in particular for adding getCharacterRenderInfos method. It will be very helpful in my project, which is a system for extracting metadata and content from scientific articles in PDF format. This is done by a sequence of steps, the first of which is extracting individual characters along with their positions on the page and dimensions. Futher steps include among others grouping individual characters into larger objects like words, lines and zones and also determining the full reading order. The implementation of the first step is based on iText and until now I have been using my own methods added to TextRenderInfo to obtain neccessary information about the individual characters. I have tested getCharacterRenderInfos method on some example PDFs and I have noticed two potential problems. The first one is about the TextRenderInfo's constructor, in which textMatrix passed as an argument is multiplied by gs.ctm. As a result, subTextMatrix is multiplied by gs.ctm again for every new TextRenderInfo object created for an individual character, if I am not mistaken this shouldn't be happening. In some cases I observed strange incorrect characters coordinates caused by this. The second problem is about character width, which i compute like this: charTri.getDescentLine().getLength(). This method relies on getStringWidth method, which adds character spacing to every character of the string, including the last one. It seems to me that as a result I get the distance from the beginning of the character to the beginning of the next one, instead of the actual character width (this can be easily observed when character spacing is large). I tried correcting the value returned from getStringWidth, but it produced incorrect results because getStringWidth is used also in PdfContentStreamProcessor to compute the distance from the beginning of the character to the beginning of the next one. Instead I have experimented with getUnscaledBaselineWithOffset and when I changed this line: return new LineSegment(new Vector(0, yOffset, 1), new Vector(getUnscaledWidth(), yOffset, 1)); to return new LineSegment(new Vector(0, yOffset, 1), new Vector(getUnscaledWidth() - gs.characterSpacing * gs.horizontalScaling, yOffset, 1)); the results seem to be correct. I would be very grateful if you could add some comments on these issues. Best regards, Dominika ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://ad.doubleclick.net/clk;258768047;13503038;j? http://info.appdynamics.com/FreeJavaPerformanceDownload.html _______________________________________________ iText-questions mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/itext-questions iText(R) is a registered trademark of 1T3XT BVBA. Many questions posted to this list can (and will) be answered with a reference to the iText book: http://www.itextpdf.com/book/ Please check the keywords list before you ask for examples: http://itextpdf.com/themes/keywords.php
