[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-25 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212052#comment-15212052
 ] 

John Hewson commented on TIKA-1285:
---

It would be better to open JIRA issues for problem PDFs so that we can improve 
the 2.0 parser.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-25 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212049#comment-15212049
 ] 

John Hewson edited comment on TIKA-1285 at 3/25/16 4:42 PM:


The parser and the rest of PDFBox are tightly coupled, so it's not possible to 
switch out the 2.0 parser for the 1.8 parser. You'd have to switch out the 
whole of PDFBox, which of course you could do if you wanted.


was (Author: jahewson):
The parser and the rest of PDFBox are tightly coupled, so it's not possible to 
switch out the 2.0 parser for the 1.8 parser. You'd have to switch out the 
whole of PDFBox, which of course you could do.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1285) Upgrade to PDFBox 2.0.0 when available

2016-03-25 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15212049#comment-15212049
 ] 

John Hewson commented on TIKA-1285:
---

The parser and the rest of PDFBox are tightly coupled, so it's not possible to 
switch out the 2.0 parser for the 1.8 parser. You'd have to switch out the 
whole of PDFBox, which of course you could do.

> Upgrade to PDFBox 2.0.0 when available
> --
>
> Key: TIKA-1285
> URL: https://issues.apache.org/jira/browse/TIKA-1285
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.6
>Reporter: Jeremy Anderson
> Fix For: 1.13
>
> Attachments: TIKA-1285.patch, TIKA-1285_rev1641423.patch, 
> TIKA-1285v3.patch, pdfbox_reports_2_0_0_20150709.zip, 
> testPDF_childAttachments.pdf
>
>
> This issue is to track fixes required when upgrading the PDFbox dependency to 
> 2.0.0 Final once it's available, and using PDFBox's daily build before then.
> See TIKA-1268 comment.
> Relates to PDFBOX-1893



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1671) Wrapped lines in PDF files not processed correctly

2015-07-17 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14631745#comment-14631745
 ] 

John Hewson commented on TIKA-1671:
---

Yep, it's possible to "tag" PDFs with accessible text, but this is rarely done. 
Other than that there are no newlines, much like SVG.

> Wrapped lines in PDF files not processed correctly
> --
>
> Key: TIKA-1671
> URL: https://issues.apache.org/jira/browse/TIKA-1671
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.9
>Reporter: James Baker
>  Labels: pdf, wrapping
> Attachments: Test Document.pdf
>
>
> Text that wraps over multiple lines in PDF documents is not extracted 
> correctly by Tika. The expected behaviour would be for it to be extracted as 
> a single line, but instead a line break is inserted at each wrap point.
> This makes it hard, if not impossible, to reassemble text into it's intended 
> form, as it is not known whether a line break in the extracted text is one 
> that appeared in the document or one that was inserted by Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1542) Substitute Apache TTF test file for current non-Apache friendly file

2015-02-06 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14309864#comment-14309864
 ] 

John Hewson commented on TIKA-1542:
---

That's a good choice.

> Substitute Apache TTF test file for current non-Apache friendly file
> 
>
> Key: TIKA-1542
> URL: https://issues.apache.org/jira/browse/TIKA-1542
> Project: Tika
>  Issue Type: Task
>Affects Versions: 1.7
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 1.8
>
>
> On PDFBOX-2383, copyrighted docs were identified in the test cases.  The ttf 
> test file was one of the offenders, and it came from our test.  Let's remove 
> our current file and substitute Aclonica, licensed as Apache v2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)