[ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-2377:
--------------------------------
    Description: 
On a small number of test files in a 50k sample of pdfs from govdocs1, it 
appears that some characters are no longer being extracted correctly in 1.8.7 
when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText

{noformat}
764949.pdf
1.8.6: Lang, Astrophysical Data: Planets and Stars
1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
{noformat}

and
{noformat}
312888.pdf
1.8.6: Self-Assessment \u0026 Capability Description
1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
{noformat}

  was:
On a small number of test files in a 50k sample of pdfs from govdocs1, it 
appears that some characters are no longer being extracted correctly.  I ran 
pdfbox's app.jar with ExtractText

{noformat}
764949.pdf
1.8.6: Lang, Astrophysical Data: Planets and Stars
1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
{noformat}

and
{noformat}
312888.pdf
1.8.6: Self-Assessment \u0026 Capability Description
1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
{noformat}


> Apparent regression in character mapping in a few files from govdocs1
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-2377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Tim Allison
>         Attachments: 312888.pdf, 764929.pdf
>
>
> On a small number of test files in a 50k sample of pdfs from govdocs1, it 
> appears that some characters are no longer being extracted correctly in 1.8.7 
> when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
> {noformat}
> 764949.pdf
> 1.8.6: Lang, Astrophysical Data: Planets and Stars
> 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
> {noformat}
> and
> {noformat}
> 312888.pdf
> 1.8.6: Self-Assessment \u0026 Capability Description
> 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to