[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172978#comment-14172978
 ] 

Tilman Hausherr commented on TIKA-1442:
---------------------------------------

files that have only junk as text with AR:

661/661834.pdf
565/565010.pdf
248/248787.pdf
979/979474.pdf
831/831528.pdf
638/638488.pdf
878/878499.pdf
503/503035.pdf
289/289669.pdf

file that has a possible virus:
345/345947.pdf (wasn't in the last test set)

files that have an error when opening with AR (although they can be displayed):
092/092919.pdf
435/435321.pdf
995/995773.pdf
078/078278.pdf
210/210260.pdf
219/219789.pdf
230/230877.pdf
268/268554.pdf
367/367594.pdf
392/392154.pdf
475/475121.pdf
477/477047.pdf
551/551464.pdf
615/615614.pdf
707/707505.pdf
714/714002.pdf
738/738627.pdf
819/819127.pdf
101/101819.pdf
359/359872.pdf
523/523690.pdf

Surprisingly, some files with LZW errors do display with AR without an error 
message. Either AR keeps quiet about it, or there is still a bug in the LZW 
decoder. Both could be possible, AR doesn't show every error, and the PDFBox 
LZW decoder is 
[tricky|https://issues.apache.org/jira/issues/?jql=labels%20%3D%20LZW].

> Upgrade to PDFBox 1.8.8
> -----------------------
>
>                 Key: TIKA-1442
>                 URL: https://issues.apache.org/jira/browse/TIKA-1442
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>             Fix For: 1.7
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to