[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters
[ https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126114#comment-14126114 ] Amir commented on PDFBOX-2259: -- would you please check this issue again? Semi-spaces is very common in different non-english languages. PDFTextStripper has problem with semi-space characters -- Key: PDFBOX-2259 URL: https://issues.apache.org/jira/browse/PDFBOX-2259 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6 Reporter: Amir Attachments: test.pdf In some right-to-left languages, compound words are separated using semi-space (please take a look at Unicode spaces: https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document contains these words, PDFTextStripper neglects semi-space character and concatenates words together. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters
[ https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089588#comment-14089588 ] John Hewson commented on PDFBOX-2259: - No, the semi-space character isn't part of the text embedded in the PDF file. The PDF contains additional marked content for accessibility, screen readers, etc, which does contain the semi-space. Only PDFMarkedContentExtractor has access to that character. However, it seems like there may be a bug in PDFMarkedContentExtractor so you're still getting the wrong result. I'll take a look soon. PDFTextStripper has problem with semi-space characters -- Key: PDFBOX-2259 URL: https://issues.apache.org/jira/browse/PDFBOX-2259 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6 Reporter: Amir Attachments: test.pdf In some right-to-left languages, compound words are separated using semi-space (please take a look at Unicode spaces: https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document contains these words, PDFTextStripper neglects semi-space character and concatenates words together. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters
[ https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089657#comment-14089657 ] Amir commented on PDFBOX-2259: -- OK. Thank you John. I'm looking for your response. PDFTextStripper has problem with semi-space characters -- Key: PDFBOX-2259 URL: https://issues.apache.org/jira/browse/PDFBOX-2259 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6 Reporter: Amir Attachments: test.pdf In some right-to-left languages, compound words are separated using semi-space (please take a look at Unicode spaces: https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document contains these words, PDFTextStripper neglects semi-space character and concatenates words together. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters
[ https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088233#comment-14088233 ] John Hewson commented on PDFBOX-2259: - I'm not sure what you mean. The linked webpage doesn't contain the phrase semi-space anywhere. What output were you expecting? Can you paste an example? PDFTextStripper has problem with semi-space characters -- Key: PDFBOX-2259 URL: https://issues.apache.org/jira/browse/PDFBOX-2259 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6 Reporter: Amir Priority: Critical Attachments: test.pdf In some right-to-left languages, compound words are separated using semi-space (please take a look at Unicode spaces: https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document contains these words, PDFTextStripper neglects semi-space character and concatenates words together. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters
[ https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088254#comment-14088254 ] Amir commented on PDFBOX-2259: -- I'm not sure what is the equivalent character for semi-space. I think it's a ZERO WIDTH SPACE. For example, check the attached document, it contains نیمفاصلهها, this word is in Persian and compounds of نیم+فاصله+ها which have been concatenated via semi-space (ZERO WIDTH SPACE). The output of PDFTextStripper is نیمفاصلهها. It's incorrect. PDFTextStripper has problem with semi-space characters -- Key: PDFBOX-2259 URL: https://issues.apache.org/jira/browse/PDFBOX-2259 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6 Reporter: Amir Attachments: test.pdf In some right-to-left languages, compound words are separated using semi-space (please take a look at Unicode spaces: https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document contains these words, PDFTextStripper neglects semi-space character and concatenates words together. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters
[ https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088351#comment-14088351 ] Amir commented on PDFBOX-2259: -- OK. I tried to inherit PDFTextStripper from PDFMarkedContentExtractor, but the problem is exist yet. Would you please give me a solution to solve such problems? Please provide me some sample code if possible. Thanks. PDFTextStripper has problem with semi-space characters -- Key: PDFBOX-2259 URL: https://issues.apache.org/jira/browse/PDFBOX-2259 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6 Reporter: Amir Attachments: test.pdf In some right-to-left languages, compound words are separated using semi-space (please take a look at Unicode spaces: https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document contains these words, PDFTextStripper neglects semi-space character and concatenates words together. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters
[ https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088842#comment-14088842 ] Amir commented on PDFBOX-2259: -- Is it possible to force PDFTextStripper to replace a semi-space with regular space? PDFTextStripper has problem with semi-space characters -- Key: PDFBOX-2259 URL: https://issues.apache.org/jira/browse/PDFBOX-2259 Project: PDFBox Issue Type: Bug Components: Text extraction Affects Versions: 1.8.6 Reporter: Amir Attachments: test.pdf In some right-to-left languages, compound words are separated using semi-space (please take a look at Unicode spaces: https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document contains these words, PDFTextStripper neglects semi-space character and concatenates words together. -- This message was sent by Atlassian JIRA (v6.2#6252)