[jira] [Commented] (PDFBOX-3058) Support TIKA Migration to PDFBox 2.0

John Hewson (JIRA) Fri, 30 Oct 2015 12:00:59 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983087#comment-14983087
 ]


John Hewson commented on PDFBOX-3058:
-------------------------------------

The results of PDFBox's text extraction shouldn't vary with language (there's 
no language-specific code in PDFTextStripper), so we should be fine just 
testing on English. On exception is for RTL scripts, it would be worth testing 
those separately. But there should be no need for a per-language breakdown.

> Support TIKA Migration to PDFBox 2.0
> ------------------------------------
>
>                 Key: PDFBOX-3058
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3058
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>            Reporter: Maruan Sahyoun
>         Attachments: NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_1_8_10.json, 
> NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_2_0.json, content_diffs-1.8-to-2.0.xlsx
>
>
> This issue is to track fixing issues which came up as part of TIKA-1285 
> (Upgrade to PDFBox 2.0.0 when available) mainly
> - new exceptions compared to PDFBox 1.8.x
> - regressions in text extraction
> - lower quality text extraction
> There should be individual issues to track tasks/bugs arising from that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3058) Support TIKA Migration to PDFBox 2.0

Reply via email to