[ 
https://issues.apache.org/jira/browse/PDFBOX-5090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275604#comment-17275604
 ] 

ASF subversion and git services commented on PDFBOX-5090:
---------------------------------------------------------

Commit 1886054 from le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1886054 ]

PDFBOX-5090: test strict mode with overflow detection

> Missing text extraction under certain conditions starting with apache pdfbox 
> 2.0.18
> -----------------------------------------------------------------------------------
>
>                 Key: PDFBOX-5090
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5090
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.18, 2.0.19, 2.0.20, 2.0.21, 2.0.22
>         Environment: jdk 1.8, apache pdfbox, fontbox 2.0.18~, windows 10
>            Reporter: sungwon kim
>            Priority: Major
>              Labels: regression
>             Fix For: 2.0.23, 3.0.0 PDFBox
>
>         Attachments: 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.pdf, 
> 128채널심장전기도시스템을위한3차원매핑소프트웨어개발.txt, 
> 128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, PDFBOX-5090_reduced.pdf, 
> textstripper_2.0.17_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, 
> textstripper_2.0.17_独立財政機関をめぐる論点整理_3p_top.PNG, 
> textstripper_2.0.18_128채널심장전기도시스템을위한3차원매핑소프트웨어개발_2p_left_botton.PNG, 
> textstripper_2.0.18_独立財政機関をめぐる論点整理_3p_top.PNG, 独立財政機関をめぐる論点整理.pdf, 
> 独立財政機関をめぐる論点整理_3p_top.PNG
>
>
> When calling PDFTextStripper.getText() function on pdfbox 2.0.18 or later, it 
> fails to extract text with any condition.
> It is suspected that the missing text extraction phenomenon is associated 
> with either the font type or the font size or text's width and height.
>  I have attached the text extraction results of version 2.0.17 and version 
> 2.0.18 and the sample data used for the test.
> code
>  
> {code:java}
> PDDocument pdDocument = PDDocument.load(new File(path));
> PDFTextStripper stripper = new PDFTextStripper();
> {code}
> dependencies
>  
> {code:java}
> <properties>
>     <apache.pdfbox.version>2.0.18</apache.pdfbox.version>
> </properties>
> <dependencies>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>pdfbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>fontbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
>     <dependency>
>         <groupId>org.apache.pdfbox</groupId>
>         <artifactId>xmpbox</artifactId>
>         <version>${apache.pdfbox.version}</version>
>     </dependency>
> </dependencies>
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to