[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-11-29 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2377:
---
Affects Version/s: (was: 2.0.0)

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
  Labels: regression
 Fix For: 1.8.8

 Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf, 
 PDFBOX2247-701542.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764929.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-11-29 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2377:
---
Fix Version/s: (was: 2.0.0)

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
  Labels: regression
 Fix For: 1.8.8

 Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf, 
 PDFBOX2247-701542.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764929.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2377:
---
Fix Version/s: 2.0.0

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7, 2.0.0
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
  Labels: regression
 Fix For: 1.8.8, 2.0.0

 Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf, 
 PDFBOX2247-701542.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764929.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2377:
---
Fix Version/s: 1.8.8

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7, 2.0.0
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
  Labels: regression
 Fix For: 1.8.8, 2.0.0

 Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf, 
 PDFBOX2247-701542.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764929.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2377:
---
Affects Version/s: 2.0.0

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7, 2.0.0
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
  Labels: regression
 Fix For: 1.8.8, 2.0.0

 Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf, 
 PDFBOX2247-701542.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764929.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-10-09 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-2377:

Description: 
On a small number of test files in a 50k sample of pdfs from govdocs1, it 
appears that some characters are no longer being extracted correctly in 1.8.7 
when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText

{noformat}
764929.pdf
1.8.6: Lang, Astrophysical Data: Planets and Stars
1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
{noformat}

and
{noformat}
312888.pdf
1.8.6: Self-Assessment \u0026 Capability Description
1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
{noformat}

  was:
On a small number of test files in a 50k sample of pdfs from govdocs1, it 
appears that some characters are no longer being extracted correctly in 1.8.7 
when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText

{noformat}
764949.pdf
1.8.6: Lang, Astrophysical Data: Planets and Stars
1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
{noformat}

and
{noformat}
312888.pdf
1.8.6: Self-Assessment \u0026 Capability Description
1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
{noformat}


 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
  Labels: regression
 Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf, 
 PDFBOX2247-701542.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764929.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-09-27 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler updated PDFBOX-2377:
---
Attachment: PDFBOX2247-701542.pdf

The file from PDFBOX-2247 as the origin link is broken

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
  Labels: regression
 Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf, 
 PDFBOX2247-701542.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764949.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-09-26 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2377:

Attachment: 357094-1.8.8.txt
357094-1.8.6.txt
357094.pdf

Same problem for 357094.pdf

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
Reporter: Tim Allison
Priority: Minor
  Labels: regression
 Attachments: 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 
 357094.pdf, 764929.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764949.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-09-26 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated PDFBOX-2377:

Attachment: 290991-8.txt
290991-7.txt
290991-6.txt
290991.pdf

290991.pdf is almost good again - except for : where there is a ..

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
Reporter: Tim Allison
Assignee: Andreas Lehmkühler
Priority: Minor
  Labels: regression
 Attachments: 290991-6.txt, 290991-7.txt, 290991-8.txt, 290991.pdf, 
 312888.pdf, 357094-1.8.6.txt, 357094-1.8.8.txt, 357094.pdf, 764929.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764949.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-09-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-2377:

Description: 
On a small number of test files in a 50k sample of pdfs from govdocs1, it 
appears that some characters are no longer being extracted correctly in 1.8.7 
when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText

{noformat}
764949.pdf
1.8.6: Lang, Astrophysical Data: Planets and Stars
1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
{noformat}

and
{noformat}
312888.pdf
1.8.6: Self-Assessment \u0026 Capability Description
1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
{noformat}

  was:
On a small number of test files in a 50k sample of pdfs from govdocs1, it 
appears that some characters are no longer being extracted correctly.  I ran 
pdfbox's app.jar with ExtractText

{noformat}
764949.pdf
1.8.6: Lang, Astrophysical Data: Planets and Stars
1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
{noformat}

and
{noformat}
312888.pdf
1.8.6: Self-Assessment \u0026 Capability Description
1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
{noformat}


 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
Reporter: Tim Allison
 Attachments: 312888.pdf, 764929.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764949.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-09-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-2377:

Attachment: 312888.pdf
764929.pdf

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
Reporter: Tim Allison
 Attachments: 312888.pdf, 764929.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly.  I ran 
 pdfbox's app.jar with ExtractText
 {noformat}
 764949.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-09-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-2377:

Affects Version/s: 1.8.7

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.7
Reporter: Tim Allison
Priority: Minor
 Attachments: 312888.pdf, 764929.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764949.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-09-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-2377:

Priority: Minor  (was: Major)

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
Affects Versions: 1.8.7
Reporter: Tim Allison
Priority: Minor
 Attachments: 312888.pdf, 764929.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764949.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

2014-09-22 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-2377:

Component/s: Text extraction

 Apparent regression in character mapping in a few files from govdocs1
 -

 Key: PDFBOX-2377
 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
 Project: PDFBox
  Issue Type: Bug
  Components: Text extraction
Affects Versions: 1.8.7
Reporter: Tim Allison
Priority: Minor
 Attachments: 312888.pdf, 764929.pdf


 On a small number of test files in a 50k sample of pdfs from govdocs1, it 
 appears that some characters are no longer being extracted correctly in 1.8.7 
 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
 {noformat}
 764949.pdf
 1.8.6: Lang, Astrophysical Data: Planets and Stars
 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
 {noformat}
 and
 {noformat}
 312888.pdf
 1.8.6: Self-Assessment \u0026 Capability Description
 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)