[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

2014-09-08 Thread Amir (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14126114#comment-14126114
 ] 

Amir commented on PDFBOX-2259:
--

would you please check this issue again? Semi-spaces is very common in 
different non-english languages.

 PDFTextStripper has problem with semi-space characters
 --

 Key: PDFBOX-2259
 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6
Reporter: Amir
 Attachments: test.pdf


 In some right-to-left languages, compound words are separated using 
 semi-space (please take a look at Unicode spaces: 
 https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
 contains these words, PDFTextStripper neglects semi-space character and 
 concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

2014-08-07 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089588#comment-14089588
 ] 

John Hewson commented on PDFBOX-2259:
-

No, the semi-space character isn't part of the text embedded in the PDF file. 
The PDF contains additional marked content for accessibility, screen readers, 
etc, which does contain the semi-space. Only PDFMarkedContentExtractor has 
access to that character.

However, it seems like there may be a bug in PDFMarkedContentExtractor so 
you're still getting the wrong result. I'll take a look soon.

 PDFTextStripper has problem with semi-space characters
 --

 Key: PDFBOX-2259
 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6
Reporter: Amir
 Attachments: test.pdf


 In some right-to-left languages, compound words are separated using 
 semi-space (please take a look at Unicode spaces: 
 https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
 contains these words, PDFTextStripper neglects semi-space character and 
 concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

2014-08-07 Thread Amir (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089657#comment-14089657
 ] 

Amir commented on PDFBOX-2259:
--

OK. Thank you John. I'm looking for your response.

 PDFTextStripper has problem with semi-space characters
 --

 Key: PDFBOX-2259
 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6
Reporter: Amir
 Attachments: test.pdf


 In some right-to-left languages, compound words are separated using 
 semi-space (please take a look at Unicode spaces: 
 https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
 contains these words, PDFTextStripper neglects semi-space character and 
 concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

2014-08-06 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088233#comment-14088233
 ] 

John Hewson commented on PDFBOX-2259:
-

I'm not sure what you mean. The linked webpage doesn't contain the phrase 
semi-space anywhere. What output were you expecting? Can you paste an example?

 PDFTextStripper has problem with semi-space characters
 --

 Key: PDFBOX-2259
 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6
Reporter: Amir
Priority: Critical
 Attachments: test.pdf


 In some right-to-left languages, compound words are separated using 
 semi-space (please take a look at Unicode spaces: 
 https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
 contains these words, PDFTextStripper neglects semi-space character and 
 concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

2014-08-06 Thread Amir (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088254#comment-14088254
 ] 

Amir commented on PDFBOX-2259:
--

I'm not sure what is the equivalent character for semi-space. I think it's a 
ZERO WIDTH SPACE. For example, check the attached document, it contains 
نیم‌فاصله‌ها, this word is in Persian and compounds of نیم+فاصله+ها 
which have been concatenated via semi-space (ZERO WIDTH SPACE). The output of 
PDFTextStripper is نیمفاصلهها. It's incorrect.

 PDFTextStripper has problem with semi-space characters
 --

 Key: PDFBOX-2259
 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6
Reporter: Amir
 Attachments: test.pdf


 In some right-to-left languages, compound words are separated using 
 semi-space (please take a look at Unicode spaces: 
 https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
 contains these words, PDFTextStripper neglects semi-space character and 
 concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

2014-08-06 Thread Amir (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088351#comment-14088351
 ] 

Amir commented on PDFBOX-2259:
--

OK. 

I tried to inherit PDFTextStripper from PDFMarkedContentExtractor, but the 
problem is exist yet.

Would you please give me a solution to solve such problems?
Please provide me some sample code if possible.
Thanks.





 PDFTextStripper has problem with semi-space characters
 --

 Key: PDFBOX-2259
 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6
Reporter: Amir
 Attachments: test.pdf


 In some right-to-left languages, compound words are separated using 
 semi-space (please take a look at Unicode spaces: 
 https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
 contains these words, PDFTextStripper neglects semi-space character and 
 concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-2259) PDFTextStripper has problem with semi-space characters

2014-08-06 Thread Amir (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088842#comment-14088842
 ] 

Amir commented on PDFBOX-2259:
--

Is it possible to force PDFTextStripper to replace a semi-space with regular 
space?

 PDFTextStripper has problem with semi-space characters
 --

 Key: PDFBOX-2259
 URL: https://issues.apache.org/jira/browse/PDFBOX-2259
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.6
Reporter: Amir
 Attachments: test.pdf


 In some right-to-left languages, compound words are separated using 
 semi-space (please take a look at Unicode spaces: 
 https://www.cs.tut.fi/~jkorpela/chars/spaces.html). When the input document 
 contains these words, PDFTextStripper neglects semi-space character and 
 concatenates words together. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)