[ https://issues.apache.org/jira/browse/PDFBOX-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075254#comment-15075254 ]
Tilman Hausherr commented on PDFBOX-3177: ----------------------------------------- Why not overwrite writeString instead? {code} protected void writeString(String text, List<TextPosition> textPositions) throws IOException {code} > Change some modifiers from private to protected in PDFTextStripper Class > ------------------------------------------------------------------------ > > Key: PDFBOX-3177 > URL: https://issues.apache.org/jira/browse/PDFBOX-3177 > Project: PDFBox > Issue Type: Improvement > Components: Text extraction > Affects Versions: 1.8.10 > Environment: All > Reporter: Praveer > Fix For: 1.8.10 > > > Hi, > I am parsing a very complicated PDF for which text extraction is not in > proper sequence, so I had to enable setSortByPosition = True. > Now I want to access each TextPosition element and do some processing with > them, normally i would override processTextPosition method and do my stuff > there, But since I have enabled setSortByPosition, the code that sorts before > extracting text is invoked after processTextPosition, so I can not override > processTextPosition to get text according to their position. > I did some research and found that overriding writeLine method of > PDFTextStripper can be useful for me > because it processes each TextPosition after they are sorted according to > their position. > So I have done a POC in my personal computer by doing following changes in > PDFTextStripper class > 1 - 'private' void writeLine() changed to 'protected' > 2 - 'private' static final class WordWithTextPositions changed to > 'protected' > After this everything works as per my expectation, I think these changes are > also going to help other people who use this library. > I can contribute this code myself, if you suggest, let me know, thanks and > regards > Praveer -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org