[jira] [Commented] (TIKA-1548) System property added while catching exception on parsing PDF encrypted doc

2015-02-11 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316723#comment-14316723
 ] 

Tilman Hausherr commented on TIKA-1548:
---

Sorry, no. We're not setting that one. It isn't in our code.

> System property added while catching exception on parsing PDF encrypted doc
> ---
>
> Key: TIKA-1548
> URL: https://issues.apache.org/jira/browse/TIKA-1548
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
> Environment: Mac OS 10.10.2
> java version "1.7.0_60"
>Reporter: David Pilato
>
> I'm using Tika 1.7. I'm parsing an encrypted PDF document which raise an 
> exception. So far, so good.
> My concern is that after that I have a new System property set 
> {{sun.font.CFontManager}}. 
> Code to reproduce the error:
> {code:java}
> @Test
> public void testSystem() {
> Properties props = System.getProperties();
> assertThat(props.get("sun.font.fontmanager"), nullValue());
> try {
> tika().parseToString(new 
> URL("https://github.com/elasticsearch/elasticsearch-mapper-attachments/raw/master/src/test/resources/org/elasticsearch/index/mapper/xcontent/encrypted.pdf";));
> } catch (Throwable e) {
> }
> assertThat(props.get("sun.font.fontmanager"), nullValue());
> }
> {code}
> With Tika 1.7:
> {code}
> [2015-02-11 16:43:36,166][INFO ][org.apache.pdfbox.pdfparser.PDFParser] 
> Document is encrypted
> [2015-02-11 16:43:36,837][ERROR][org.apache.pdfbox.filter.FlateFilter] 
> FlateFilter: stop reading corrupt stream due to a DataFormatException
> [2015-02-11 16:43:36,837][ERROR][org.apache.pdfbox.filter.FlateFilter] 
> FlateFilter: stop reading corrupt stream due to a DataFormatException
> [2015-02-11 16:43:36,838][ERROR][org.apache.pdfbox.filter.FlateFilter] 
> FlateFilter: stop reading corrupt stream due to a DataFormatException
> [2015-02-11 16:43:36,838][ERROR][org.apache.pdfbox.filter.FlateFilter] 
> FlateFilter: stop reading corrupt stream due to a DataFormatException
> [2015-02-11 16:43:36,839][ERROR][org.apache.pdfbox.filter.FlateFilter] 
> FlateFilter: stop reading corrupt stream due to a DataFormatException
> [2015-02-11 16:43:36,840][ERROR][org.apache.pdfbox.filter.FlateFilter] 
> FlateFilter: stop reading corrupt stream due to a DataFormatException
> [2015-02-11 16:43:36,840][ERROR][org.apache.pdfbox.filter.FlateFilter] 
> FlateFilter: stop reading corrupt stream due to a DataFormatException
> [2015-02-11 16:43:36,841][ERROR][org.apache.pdfbox.filter.FlateFilter] 
> FlateFilter: stop reading corrupt stream due to a DataFormatException
> [2015-02-11 16:43:36,841][ERROR][org.apache.pdfbox.filter.FlateFilter] 
> FlateFilter: stop reading corrupt stream due to a DataFormatException
> [2015-02-11 16:43:36,842][ERROR][org.apache.pdfbox.filter.FlateFilter] 
> FlateFilter: stop reading corrupt stream due to a DataFormatException
> java.lang.AssertionError: 
> Expected: null
>  but: was "sun.font.CFontManager"
>  
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
>   at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:8)
>   at 
> org.elasticsearch.plugin.mapper.attachments.test.TikaSystemTest.testSystem(TikaSystemTest.java:41)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:157)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:74)
>   at 
> com.int

[jira] [Commented] (TIKA-1038) Parsing PDF with StackOverlowError

2015-03-04 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347377#comment-14347377
 ] 

Tilman Hausherr commented on TIKA-1038:
---

[~talli...@mitre.org]are you watching this one? I made a (hopefully useful) 
response in PDFBOX-1835.

> Parsing PDF with StackOverlowError 
> ---
>
> Key: TIKA-1038
> URL: https://issues.apache.org/jira/browse/TIKA-1038
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
>Reporter: Konstantin Privezentsev
>
> Tika corrupt with StackOverflowError on some pdf documents:
> http://www.ellipse-labo.com/fiches/1303214351.pdf
> http://downloads.joomlacode.org/frsrelease/5/4/0/54089/handbuch_ckforms-DE-1.3.2.pdf
> Code:
> {code:java}
> AutoDetectParser parser = new AutoDetectParser(
> new TypeDetector(),
> new PDFParser(),
> new OfficeParser(),
> new HtmlParser(),
> new RTFParser(),
> new OOXMLParser());
> WriteOutContentHandler contentHandler = new WriteOutContentHandler();
> Metadata metadata = new Metadata();
> parser.parse(contentStream, new BodyContentHandler(contentHandler), metadata, 
> new ParseContext());
> {code}
> Stack trace:
> {code}
> java.lang.StackOverflowError
>   at 
> java.util.LinkedHashMap$LinkedHashIterator.(LinkedHashMap.java:345)
>   at 
> java.util.LinkedHashMap$LinkedHashIterator.(LinkedHashMap.java:345)
>   at java.util.LinkedHashMap$KeyIterator.(LinkedHashMap.java:383)
>   at java.util.LinkedHashMap$KeyIterator.(LinkedHashMap.java:383)
>   at java.util.LinkedHashMap.newKeyIterator(LinkedHashMap.java:396)
>   at java.util.HashMap$KeySet.iterator(HashMap.java:874)
>   at org.apache.pdfbox.cos.COSDictionary.toString(COSDictionary.java:1416)
>   at org.apache.pdfbox.cos.COSDictionary.toString(COSDictionary.java:1421)
>   at org.apache.pdfbox.cos.COSDictionary.toString(COSDictionary.java:1421)
>   at org.apache.pdfbox.cos.COSDictionary.toString(COSDictionary.java:1421)
>   at org.apache.pdfbox.cos.COSDictionary.toString(COSDictionary.java:1421)
> ...
> {code}
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1038) Parsing PDF with StackOverlowError

2015-03-04 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14347377#comment-14347377
 ] 

Tilman Hausherr edited comment on TIKA-1038 at 3/4/15 6:59 PM:
---

[~talli...@mitre.org]  are you watching this one? I made a (hopefully useful) 
response in PDFBOX-1835.


was (Author: tilman):
[~talli...@mitre.org]are you watching this one? I made a (hopefully useful) 
response in PDFBOX-1835.

> Parsing PDF with StackOverlowError 
> ---
>
> Key: TIKA-1038
> URL: https://issues.apache.org/jira/browse/TIKA-1038
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
>Reporter: Konstantin Privezentsev
>
> Tika corrupt with StackOverflowError on some pdf documents:
> http://www.ellipse-labo.com/fiches/1303214351.pdf
> http://downloads.joomlacode.org/frsrelease/5/4/0/54089/handbuch_ckforms-DE-1.3.2.pdf
> Code:
> {code:java}
> AutoDetectParser parser = new AutoDetectParser(
> new TypeDetector(),
> new PDFParser(),
> new OfficeParser(),
> new HtmlParser(),
> new RTFParser(),
> new OOXMLParser());
> WriteOutContentHandler contentHandler = new WriteOutContentHandler();
> Metadata metadata = new Metadata();
> parser.parse(contentStream, new BodyContentHandler(contentHandler), metadata, 
> new ParseContext());
> {code}
> Stack trace:
> {code}
> java.lang.StackOverflowError
>   at 
> java.util.LinkedHashMap$LinkedHashIterator.(LinkedHashMap.java:345)
>   at 
> java.util.LinkedHashMap$LinkedHashIterator.(LinkedHashMap.java:345)
>   at java.util.LinkedHashMap$KeyIterator.(LinkedHashMap.java:383)
>   at java.util.LinkedHashMap$KeyIterator.(LinkedHashMap.java:383)
>   at java.util.LinkedHashMap.newKeyIterator(LinkedHashMap.java:396)
>   at java.util.HashMap$KeySet.iterator(HashMap.java:874)
>   at org.apache.pdfbox.cos.COSDictionary.toString(COSDictionary.java:1416)
>   at org.apache.pdfbox.cos.COSDictionary.toString(COSDictionary.java:1421)
>   at org.apache.pdfbox.cos.COSDictionary.toString(COSDictionary.java:1421)
>   at org.apache.pdfbox.cos.COSDictionary.toString(COSDictionary.java:1421)
>   at org.apache.pdfbox.cos.COSDictionary.toString(COSDictionary.java:1421)
> ...
> {code}
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-15 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362365#comment-14362365
 ] 

Tilman Hausherr commented on TIKA-1575:
---

{code}
b) might be actual modest regressions with
147/147012.pdf
223/223704.pdf
{code}
No difference with extractText. I've opened PDFBOX-2710 about the missing form 
fields.

> Upgrade to PDFBox 1.8.9 when available
> --
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 10-814_Appendix B_v3.pdf, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-15 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362406#comment-14362406
 ] 

Tilman Hausherr commented on TIKA-1575:
---

[~talli...@apache.org] please repeat the whole test - Maruan fixed the bug. 
(Which wouldn't have been discovered so fast without the test!)

> Upgrade to PDFBox 1.8.9 when available
> --
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 10-814_Appendix B_v3.pdf, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1174) Invalid characters in filtered PDF output

2015-03-15 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362552#comment-14362552
 ] 

Tilman Hausherr commented on TIKA-1174:
---

Can't comment, I'm not that good with font issues, and I don't get the warning 
when running ExtractText.

> Invalid characters in filtered PDF output
> -
>
> Key: TIKA-1174
> URL: https://issues.apache.org/jira/browse/TIKA-1174
> Project: Tika
>  Issue Type: Bug
>  Components: parser
> Environment: Mac OS X 10.8.5, Java 1.7u40 (but also seen on CentOS5)
>Reporter: Matt Sheppard
>Priority: Minor
> Attachments: map_sp_1c_a4.pdf
>
>
> The PDF document at 
> http://www.logan.qld.gov.au/__data/assets/pdf_file/0010/9496/map_sp_1a_a4.pdf 
> produces invalid characters in the output when filtered by Tika 1.4.
> {noformat}
> >
> /opt/funnelback/mbin/java/bin/java -jar tika-app-1.4.jar map_sp_1c_a4.pdf | 
> hea…
> …d -n 40
> ERROR - Error: Could not parse predefined CMAP file for 'nullžf 
> °-ˇžl,¡ì$1-UCS2'
>  xmlns="http://www.w3.org/1999/xhtml";>
> 
> [snip]
> Cycle network
> 
> 
> 
> HILEY
> 
> {noformat}
> Is there any proper way to avoid this, or is the best approach to strip such 
> characters from Tika's output?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-16 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363061#comment-14363061
 ] 

Tilman Hausherr commented on TIKA-1575:
---

Yes!

> Upgrade to PDFBox 1.8.9 when available
> --
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 10-814_Appendix B_v3.pdf, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364710#comment-14364710
 ] 

Tilman Hausherr commented on TIKA-1575:
---

Could you attach the TIKA output you get with 1.8.8 for 005937.pdf ? For 
example, I don't get the word "monitoring" anywhere in that PDF.

> Upgrade to PDFBox 1.8.9 when available
> --
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 10-814_Appendix B_v3.pdf, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365524#comment-14365524
 ] 

Tilman Hausherr commented on TIKA-1575:
---

I can't understand how you get the extracted text for p14. I don't get any with 
PDFBox versions 1.8.6, 1.8.7, 1.8.8, and 1.8.9 ?! Is TIKA using some OCR 
software additionally to PDF extraczion?

> Upgrade to PDFBox 1.8.9 when available
> --
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365807#comment-14365807
 ] 

Tilman Hausherr commented on TIKA-1575:
---

Can't tell, I don't know much about the structure of 1.8.*. Years ago I started 
to use the unreleased 2.0 version and I was hooked :-)

> Upgrade to PDFBox 1.8.9 when available
> --
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-17 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14365829#comment-14365829
 ] 

Tilman Hausherr commented on TIKA-1575:
---

Thanks. Re: OCR, you should know that there was a GSoC2014 project by PDFBox 
committer John Hewson to combine tesseract and PDFBox (PDFBOX-1912). I don't 
know the details, but maybe it could be useful for TIKA.

> Upgrade to PDFBox 1.8.9 when available
> --
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368686#comment-14368686
 ] 

Tilman Hausherr commented on TIKA-1575:
---

With the pure ExtractText, all is identical. Could you attach the files you get 
for 524276.pdf and 719128.pdf?

> Upgrade to PDFBox 1.8.9 when available
> --
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368687#comment-14368687
 ] 

Tilman Hausherr commented on TIKA-1575:
---

With the pure ExtractText, all is identical. Could you attach the files you get 
for 524276.pdf and 719128.pdf?

> Upgrade to PDFBox 1.8.9 when available
> --
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-19 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1575:
--
Comment: was deleted

(was: With the pure ExtractText, all is identical. Could you attach the files 
you get for 524276.pdf and 719128.pdf?)

> Upgrade to PDFBox 1.8.9 when available
> --
>
> Key: TIKA-1575
> URL: https://issues.apache.org/jira/browse/TIKA-1575
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
> 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
> PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
> PDFBox_1_8_8Vs1_8_9_20150316.zip, content_diffs_20150316.xlsx
>
>
> The PDFBox community is about to release 1.8.9.  Let's use this issue to 
> track discussions before the release and to track Tika's upgrade to PDFBox 
> 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1588) Upgrade to PDFBox 1.8.10 when available

2015-07-15 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14628890#comment-14628890
 ] 

Tilman Hausherr commented on TIKA-1588:
---

The weird thing is that I can't find any differences with ExtractText and 
default settings. "respondæ" appears in both extractions. "æ" is an arrow in 
the PDF.

> Upgrade to PDFBox 1.8.10 when available
> ---
>
> Key: TIKA-1588
> URL: https://issues.apache.org/jira/browse/TIKA-1588
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: reports_1_8_9_vs_1_8_10.zip
>
>
> Let's use this ticket to discuss/prepare for the release and integration of 
> PDFBox 1.8.10 when it is available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-18 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632429#comment-14632429
 ] 

Tilman Hausherr commented on TIKA-1678:
---

I think this is two bytes. I.e. a 0x0 and a 'B'.

> PDF metadata extraction fails to spot UTF-16 encoded title
> --
>
> Key: TIKA-1678
> URL: https://issues.apache.org/jira/browse/TIKA-1678
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.9
>Reporter: Andrew Jackson
>Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> < /Subtype/XML/Length 1978>>stream
> 
> 
> 
>  xmlns:iX='http://ns.adobe.com/iX/1.0/'>
>  xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
>  xmlns:xmp='http://ns.adobe.com/xap/1.0/'>2012-07-18T15:38:01+01:00
> 2012-07-18T15:38:01+01:00
> UnknownApplication
>  xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/>
>  xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x\376\377\000T\000e\000t\000t\000i
> 
> 
> 
> endstream
> endobj
> 2 0 obj
> < \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML dc:title being used to override the PDF title field? Or is one of the 
> title fields being decoded incorrectly?
> (I accept that although this is a real PDF document from the web, it is also 
> a malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-18 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632432#comment-14632432
 ] 

Tilman Hausherr commented on TIKA-1678:
---

I get correct output for the non-XMP stuff with this code:
{code}
PDDocumentInformation info = doc.getDocumentInformation();
System.out.println("Author: " + info.getAuthor());
System.out.println("Producer: " + info.getProducer());
System.out.println("Title: " + info.getTitle());
System.out.println("Title contains 'Microsoft': " + 
info.getTitle().contains("Microsoft"));
{code}
{quote}
Author: Tetti
Producer: Bullzip PDF Printer / www.bullzip.com / Freeware Edition
Title: Microsoft PowerPoint - Introduction to Worklight (SRD).pptx
Title contains 'Microsoft': true
{quote}
So I don't know why TIKA gets this:
{code}
title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
\000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
\000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 \000t\000o\000 
\000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
\000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
{code}
A look at the TIKA source code (PDFParser.extractMultilingualItems()) shows 
that if both have a value, then the XMP value is taken. I don't use tika, so I 
don't know what the difference between "title" and "dc:title" is. I suspect it 
is the same, and taken from XMP.

> PDF metadata extraction fails to spot UTF-16 encoded title
> --
>
> Key: TIKA-1678
> URL: https://issues.apache.org/jira/browse/TIKA-1678
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.9
>Reporter: Andrew Jackson
>Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> < /Subtype/XML/Length 1978>>stream
> 
> 
> 
>  xmlns:iX='http://ns.adobe.com/iX/1.0/'>
>  xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
>  xmlns:xmp='http://ns.adobe.com/xap/1.0/'>2012-07-18T15:38:01+01:00
> 2012-07-18T15:38:01+01:00
> UnknownApplication
>  xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/>
>  xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x\376\377\000T\000e\000t\000t\000i
> 
> 
> 
> endstream
> endobj
> 2 0 obj
> < \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000

[jira] [Comment Edited] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632429#comment-14632429
 ] 

Tilman Hausherr edited comment on TIKA-1678 at 7/19/15 11:21 AM:
-

I think this is two bytes. I.e. a 0x0 and a 'B'. In PDF, octals are written as 
\\ddd, i.e. always three digits.


was (Author: tilman):
I think this is two bytes. I.e. a 0x0 and a 'B'.

> PDF metadata extraction fails to spot UTF-16 encoded title
> --
>
> Key: TIKA-1678
> URL: https://issues.apache.org/jira/browse/TIKA-1678
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.9
>Reporter: Andrew Jackson
>Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> < /Subtype/XML/Length 1978>>stream
> 
> 
> 
>  xmlns:iX='http://ns.adobe.com/iX/1.0/'>
>  xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
>  xmlns:xmp='http://ns.adobe.com/xap/1.0/'>2012-07-18T15:38:01+01:00
> 2012-07-18T15:38:01+01:00
> UnknownApplication
>  xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/>
>  xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x\376\377\000T\000e\000t\000t\000i
> 
> 
> 
> endstream
> endobj
> 2 0 obj
> < \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML dc:title being used to override the PDF title field? Or is one of the 
> title fields being decoded incorrectly?
> (I accept that although this is a real PDF document from the web, it is also 
> a malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14632429#comment-14632429
 ] 

Tilman Hausherr edited comment on TIKA-1678 at 7/19/15 11:22 AM:
-

I think this is two bytes. I.e. a 0x0 and a 'B'. In PDF, octals are written as 
{{\ddd}}, i.e. always three digits. The next character you see is just that, a 
character, and not part of an octal sequence.


was (Author: tilman):
I think this is two bytes. I.e. a 0x0 and a 'B'. In PDF, octals are written as 
\\ddd, i.e. always three digits.

> PDF metadata extraction fails to spot UTF-16 encoded title
> --
>
> Key: TIKA-1678
> URL: https://issues.apache.org/jira/browse/TIKA-1678
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.9
>Reporter: Andrew Jackson
>Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> < /Subtype/XML/Length 1978>>stream
> 
> 
> 
>  xmlns:iX='http://ns.adobe.com/iX/1.0/'>
>  xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
>  xmlns:xmp='http://ns.adobe.com/xap/1.0/'>2012-07-18T15:38:01+01:00
> 2012-07-18T15:38:01+01:00
> UnknownApplication
>  xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/>
>  xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x\376\377\000T\000e\000t\000t\000i
> 
> 
> 
> endstream
> endobj
> 2 0 obj
> < \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML dc:title being used to override the PDF title field? Or is one of the 
> title fields being decoded incorrectly?
> (I accept that although this is a real PDF document from the web, it is also 
> a malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633687#comment-14633687
 ] 

Tilman Hausherr commented on TIKA-1678:
---

sure:
{code}
public class Tika1678 extends BaseParser
{
public static void main(String[] args) throws IOException
{
Tika1678 tika1678 = new Tika1678();

String s1 = "\\376\\377\\000B\\000u\\000l\\000l\\000z\\000i\\000p\\000 
\\000P\\000D\\000F\\000 \\000P\\000r\\000i\\000n\\000t\\000e\\000r\\000 
\\000/\\000 
\\000w\\000w\\000w\\000.\\000b\\000u\\000l\\000l\\000z\\000i\\000p\\000.\\000c\\000o\\000m\\000
 \\000/\\000 \\000F\\000r\\000e\\000e\\000w\\000a\\000r\\000e\\000 
\\000E\\000d\\000i\\000t\\000i\\000o\\000n";
String s2 = "(" + s1 + ")";

ByteArrayInputStream bais = new 
ByteArrayInputStream(s2.getBytes("ISO-8859-1"));
tika1678.pdfSource = new RandomAccessBufferedFileInputStream(bais);
COSString cosString = tika1678.parseCOSString();
System.out.println("cosString: " + cosString.getString());
}
}
{code}
output:
{quote}
cosString: Bullzip PDF Printer / www.bullzip.com / Freeware Edition
{quote}

However this code will work only for examples like mentioned. I expect mayhem 
if this is used on a UTF8 sequence.


> PDF metadata extraction fails to spot UTF-16 encoded title
> --
>
> Key: TIKA-1678
> URL: https://issues.apache.org/jira/browse/TIKA-1678
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.9
>Reporter: Andrew Jackson
>Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> < /Subtype/XML/Length 1978>>stream
> 
> 
> 
>  xmlns:iX='http://ns.adobe.com/iX/1.0/'>
>  xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
>  xmlns:xmp='http://ns.adobe.com/xap/1.0/'>2012-07-18T15:38:01+01:00
> 2012-07-18T15:38:01+01:00
> UnknownApplication
>  xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/>
>  xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x\376\377\000T\000e\000t\000t\000i
> 
> 
> 
> endstream
> endobj
> 2 0 obj
> < \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> 

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633722#comment-14633722
 ] 

Tilman Hausherr commented on TIKA-1678:
---

Yes, such a string check would be useful. Or just check for backslash number 
number number.

> PDF metadata extraction fails to spot UTF-16 encoded title
> --
>
> Key: TIKA-1678
> URL: https://issues.apache.org/jira/browse/TIKA-1678
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.9
>Reporter: Andrew Jackson
>Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> < /Subtype/XML/Length 1978>>stream
> 
> 
> 
>  xmlns:iX='http://ns.adobe.com/iX/1.0/'>
>  xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
>  xmlns:xmp='http://ns.adobe.com/xap/1.0/'>2012-07-18T15:38:01+01:00
> 2012-07-18T15:38:01+01:00
> UnknownApplication
>  xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/>
>  xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x\376\377\000T\000e\000t\000t\000i
> 
> 
> 
> endstream
> endobj
> 2 0 obj
> < \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML dc:title being used to override the PDF title field? Or is one of the 
> title fields being decoded incorrectly?
> (I accept that although this is a real PDF document from the web, it is also 
> a malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634045#comment-14634045
 ] 

Tilman Hausherr commented on TIKA-1678:
---

Likely a bug. I tried calling getTitele after setTitle and get an NPE. I 
submitted the PDFA file to 
http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx
and it fails. And I compared a PDFA file from the Bavaria tests to ours and 
there's a difference:
{code}
  

  this is the title

  
{code}
{code}
  

  PDF/A-1b test

  
{code}


> PDF metadata extraction fails to spot UTF-16 encoded title
> --
>
> Key: TIKA-1678
> URL: https://issues.apache.org/jira/browse/TIKA-1678
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.9
>Reporter: Andrew Jackson
>Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> < /Subtype/XML/Length 1978>>stream
> 
> 
> 
>  xmlns:iX='http://ns.adobe.com/iX/1.0/'>
>  xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
>  xmlns:xmp='http://ns.adobe.com/xap/1.0/'>2012-07-18T15:38:01+01:00
> 2012-07-18T15:38:01+01:00
> UnknownApplication
>  xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/>
>  xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x\376\377\000T\000e\000t\000t\000i
> 
> 
> 
> endstream
> endobj
> 2 0 obj
> < \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML dc:title being used to override the PDF title field? Or is one of the 
> title fields being decoded incorrectly?
> (I accept that although this is a real PDF document from the web, it is also 
> a malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634045#comment-14634045
 ] 

Tilman Hausherr edited comment on TIKA-1678 at 7/20/15 8:41 PM:


Likely a bug. I tried calling getTitle after setTitle and get an NPE. I 
submitted the PDFA file to 
http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx
and it fails. And I compared a PDFA file from the Bavaria tests to ours and 
there's a difference:
{code}
  

  this is the title

  
{code}
{code}
  

  PDF/A-1b test

  
{code}



was (Author: tilman):
Likely a bug. I tried calling getTitele after setTitle and get an NPE. I 
submitted the PDFA file to 
http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx
and it fails. And I compared a PDFA file from the Bavaria tests to ours and 
there's a difference:
{code}
  

  this is the title

  
{code}
{code}
  

  PDF/A-1b test

  
{code}


> PDF metadata extraction fails to spot UTF-16 encoded title
> --
>
> Key: TIKA-1678
> URL: https://issues.apache.org/jira/browse/TIKA-1678
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.9
>Reporter: Andrew Jackson
>Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> < /Subtype/XML/Length 1978>>stream
> 
> 
> 
>  xmlns:iX='http://ns.adobe.com/iX/1.0/'>
>  xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
>  xmlns:xmp='http://ns.adobe.com/xap/1.0/'>2012-07-18T15:38:01+01:00
> 2012-07-18T15:38:01+01:00
> UnknownApplication
>  xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/>
>  xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x\376\377\000T\000e\000t\000t\000i
> 
> 
> 
> endstream
> endobj
> 2 0 obj
> < \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML dc:title being used to override the PDF title field? Or is one of th

[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-20 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14634065#comment-14634065
 ] 

Tilman Hausherr commented on TIKA-1678:
---

Yes please do and attach the file. It's late here. It's two bugs, the wrong XML 
is produced, and the bad XML isn't flagged by preflight.

> PDF metadata extraction fails to spot UTF-16 encoded title
> --
>
> Key: TIKA-1678
> URL: https://issues.apache.org/jira/browse/TIKA-1678
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.9
>Reporter: Andrew Jackson
>Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> < /Subtype/XML/Length 1978>>stream
> 
> 
> 
>  xmlns:iX='http://ns.adobe.com/iX/1.0/'>
>  xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
>  xmlns:xmp='http://ns.adobe.com/xap/1.0/'>2012-07-18T15:38:01+01:00
> 2012-07-18T15:38:01+01:00
> UnknownApplication
>  xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/>
>  xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x\376\377\000T\000e\000t\000t\000i
> 
> 
> 
> endstream
> endobj
> 2 0 obj
> < \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML dc:title being used to override the PDF title field? Or is one of the 
> title fields being decoded incorrectly?
> (I accept that although this is a real PDF document from the web, it is also 
> a malformed one, so maybe there is not much to be done here.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1678) PDF metadata extraction fails to spot UTF-16 encoded title

2015-07-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637232#comment-14637232
 ] 

Tilman Hausherr commented on TIKA-1678:
---

API has changed again. This code works:
{code}
public class Tika1678 extends COSParser
{
public static void main(String[] args) throws IOException
{
String s1 = "\\376\\377\\000B\\000u\\000l\\000l\\000z\\000i\\000p\\000 
\\000P\\000D\\000F\\000 \\000P\\000r\\000i\\000n\\000t\\000e\\000r\\000 \

\000/\\000 
\\000w\\000w\\000w\\000.\\000b\\000u\\000l\\000l\\000z\\000i\\000p\\000.\\000c\\000o\\000m\\000
 \\000/\\000 \\000F\\000r\\000e\\000e\\000w\

\000a\\000r\\000e\\000 \\000E\\000d\\000i\\000t\\000i\\000o\\000n";
String s2 = "(" + s1 + ")";

ByteArrayInputStream bais = new 
ByteArrayInputStream(s2.getBytes("ISO-8859-1"));
Tika1678 tika1678 = new Tika1678(new 
RandomAccessBufferedFileInputStream(bais));
COSString cosString = tika1678.parseCOSString();
System.out.println("cosString: " + cosString.getString());
}

public Tika1678(RandomAccessRead source)
{
super(source);
}

}
{code}


> PDF metadata extraction fails to spot UTF-16 encoded title
> --
>
> Key: TIKA-1678
> URL: https://issues.apache.org/jira/browse/TIKA-1678
> Project: Tika
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.9
>Reporter: Andrew Jackson
>Priority: Minor
>
> When extracting metadata from PDFs, we see some odd behaviour for a minority 
> of the documents. The PDF metadata can be encoded as UTF-18 octets, but is 
> not always being decoded as such.
> A specific example is here: 
> http://mqug.org.uk/downloads/201207/201207%20-%20TEC02%20-%20Introduction%20to%20Worklight.pdf
> Which contains this (literal file content):
> {noformat}
> 443 0 obj
> < /Subtype/XML/Length 1978>>stream
> 
> 
> 
>  xmlns:iX='http://ns.adobe.com/iX/1.0/'>
>  xmlns:pdf='http://ns.adobe.com/pdf/1.3/' 
> pdf:Producer='\376\377\000B\000u\000l\000l\000z\000i\000p\000 
> \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n'/>
>  xmlns:xmp='http://ns.adobe.com/xap/1.0/'>2012-07-18T15:38:01+01:00
> 2012-07-18T15:38:01+01:00
> UnknownApplication
>  xmlns:xapMM='http://ns.adobe.com/xap/1.0/mm/' 
> xapMM:DocumentID='ac9f232e-d341-11e1--ba905bfc4694'/>
>  xmlns:dc='http://purl.org/dc/elements/1.1/' 
> dc:format='application/pdf'> xml:lang='x-default'>\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000
>  \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x\376\377\000T\000e\000t\000t\000i
> 
> 
> 
> endstream
> endobj
> 2 0 obj
> < \000P\000D\000F\000 \000P\000r\000i\000n\000t\000e\000r\000 \000/\000 
> \000w\000w\000w\000.\000b\000u\000l\000l\000z\000i\000p\000.\000c\000o\000m\000
>  \000/\000 \000F\000r\000e\000e\000w\000a\000r\000e\000 
> \000E\000d\000i\000t\000i\000o\000n)
> /CreationDate(D:20120718153801+01'00')
> /ModDate(D:20120718153801+01'00')
> /Title(\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x)
> /Author(\376\377\000T\000e\000t\000t\000i)>>endobj
> {noformat} 
> Presumably, embedding these UTF-16 octet sequences in the XMP RDF is an 
> error, but the ones encoded in the actual PDF metadata fields should be 
> extracted accurately.
> When extracted, we get:
> {noformat}
> ...
> dc:title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> title: \376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000 
> \000P\000o\000w\000e\000r\000P\000o\000i\000n\000t\000 \000-\000 
> \000I\000n\000t\000r\000o\000d\000u\000c\000t\000i\000o\000n\000 
> \000t\000o\000 \000W\000o\000r\000k\000l\000i\000g\000h\000t\000 
> \000\(\000S\000R\000D\000\)\000.\000p\000p\000t\000x
> meta:author: \376\377\000T\000e\000t\000t\000i
> meta:author: Tetti
> ...
> {noformat}
> So, the author appears to be decoded correctly once, but the title is not. Is 
> the XML d

[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901042#comment-14901042
 ] 

Tilman Hausherr commented on TIKA-1737:
---

Some of the exceptions (the classcastexceptions in the 
org.apache.pdfbox.util.operator) have an obvious cause that would be easy to 
prevent. For others I would need to get the PDF files, and I'm not sure that 
these can be fixed in the 1.8 version.

The best would be to create an issue in PDFBox for each class of errors. And 
then track whether the number of unchecked exceptions goes down.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901042#comment-14901042
 ] 

Tilman Hausherr edited comment on TIKA-1737 at 9/21/15 8:49 PM:


Some of the exceptions (the classcastexceptions in the 
org.apache.pdfbox.util.operator) have an obvious cause that I have fixed in 
PDFBOX-2982. For others I would need to get the PDF files, and I'm not sure 
that these can be fixed in the 1.8 version.

The best would be to create an issue in PDFBox for each class of errors. And 
then track whether the number of unchecked exceptions goes down.


was (Author: tilman):
Some of the exceptions (the classcastexceptions in the 
org.apache.pdfbox.util.operator) have an obvious cause that would be easy to 
prevent. For others I would need to get the PDF files, and I'm not sure that 
these can be fixed in the 1.8 version.

The best would be to create an issue in PDFBox for each class of errors. And 
then track whether the number of unchecked exceptions goes down.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-21 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901502#comment-14901502
 ] 

Tilman Hausherr commented on TIKA-1737:
---

We will definitively not be able to find the cause of memory leaks without the 
files. You'll have to do that yourself, e.g. by running the PDFTextStripper 
with the current version and with an older version, and then profile, and then 
use different revisions to find out when it started to be bad.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903259#comment-14903259
 ] 

Tilman Hausherr commented on TIKA-1737:
---

Re the ArrayIndexOutOfBoundsException - are you using multithreading? I wonder 
if it is possibly related to PDFBOX-2824. That was fixed in the 2.0 version 
only.

Re the NPE in PDFStreamEngine.java:355 - this is possibly solved in 1.8.11.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-09-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903537#comment-14903537
 ] 

Tilman Hausherr commented on TIKA-1737:
---

No, PDFBOX-2987 is another one I fixed for you. The NPE in 
PDFStreamEngine.java:355 was (hopefully) fixed in PDFBOX-2935. To test this, 
you'd need to use an 1.8.11 snapshot version.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1759) Extract contributor metadata from supporting file formats

2015-10-01 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940086#comment-14940086
 ] 

Tilman Hausherr commented on TIKA-1759:
---

But you already have the author from /Info and from the XMP metadata, isn't 
that enough?

> Extract contributor metadata from supporting file formats
> -
>
> Key: TIKA-1759
> URL: https://issues.apache.org/jira/browse/TIKA-1759
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: contributors.zip
>
>
> Many common file formats store information about contributors (broadly 
> speaking) to a document.  We are currently extracting author/creator and 
> modifier/last author.  Let's add extraction for:
> # comment authors
> # revisers (authors who make changes with track changes on)
> # signers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1759) Extract contributor metadata from supporting file formats

2015-10-01 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940158#comment-14940158
 ] 

Tilman Hausherr commented on TIKA-1759:
---

Sorry, I can't help you with that one, because I haven't worked with that part 
of the PDF spec. You'll have to ask the whole gang or look for yourself with 
PDFDebugger.

> Extract contributor metadata from supporting file formats
> -
>
> Key: TIKA-1759
> URL: https://issues.apache.org/jira/browse/TIKA-1759
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Minor
> Attachments: contributors.zip
>
>
> Many common file formats store information about contributors (broadly 
> speaking) to a document.  We are currently extracting author/creator and 
> modifier/last author.  Let's add extraction for:
> # comment authors
> # revisers (authors who make changes with track changes on)
> # signers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1737) PDFBox 1.8.10 is still a basket case

2015-10-05 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14944574#comment-14944574
 ] 

Tilman Hausherr commented on TIKA-1737:
---

And I'd be interested to hear whether the situation described at the beginning 
has improved or not.

> PDFBox 1.8.10 is still a basket case
> 
>
> Key: TIKA-1737
> URL: https://issues.apache.org/jira/browse/TIKA-1737
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.10
> Environment: Linux, Solaris
>Reporter: Alan Burlison
> Attachments: pdfbox.txt
>
>
> In TIKA-1471 I reported OOM errors when parsing PDF files. According to that 
> bug the issues were fixed in 1.7. I've just updated to Tika 1.10 and rather 
> than PDFBox being better it's actually far, far worse. With the same corpus, 
> Tika 1.5 (PDFBox 1.8.6) has 13 exceptions thrown by PDFBox, Tika 1.10 (PDFBox 
> 1.8.10) has *453* exceptions thrown by PDFBox. Not only that, but as far as I 
> can tell, the memory leaks are even worse in 1.8.10 as well.
> I've had to resort to destroying the Tika instances and starting over each 
> time there's an error indexing a PDF file. It's so bad I'm going to switch to 
> running pdftotext (part of Xpdf) as an external process. Note that many of 
> the errors in PDFBox are clearly caused by programming errors, e.g. 
> ArrayIndexOutOfBoundsException, ClassCastException, NullPointerException and 
> EOFException.
> I strongly recommend that Tika either reverts back to PDFBox 1.8.6 or finds a 
> replacement for PDFBox as 1.8.10 just isn't fit for purpose.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-13 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096866#comment-15096866
 ] 

Tilman Hausherr commented on TIKA-1830:
---

I can't reproduce the difference for the file 074531.pdf. ExtractText returns 
identical results, that makes me doubt on the entire test :-(

I can reproduce the difference for 290377.pdf, this is because of a change in 
decompression (rev 1709182) that tries to squeeze as much as possible from 
corrupt streams.

There may be some differences due to a bugfix related to "article beads". This 
will mean improved results for files with correct beads, but worse results for 
files where bead rectangles are incorrect.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098401#comment-15098401
 ] 

Tilman Hausherr commented on TIKA-1830:
---

{quote}
On PDFBOX-3193, you've set affected versions to 1.8.10 and 1.8.11. Are you sure 
that that affects 1.8.10? The discovery of that wouldn't have happened unless I 
was actually running 1.8.11. 
{quote}
Indeed, sorry.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098401#comment-15098401
 ] 

Tilman Hausherr edited comment on TIKA-1830 at 1/14/16 5:02 PM:


{quote}
On PDFBOX-3193, you've set affected versions to 1.8.10 and 1.8.11. Are you sure 
that that affects 1.8.10? The discovery of that wouldn't have happened unless I 
was actually running 1.8.11. 
{quote}
Indeed, sorry. Fixed.


was (Author: tilman):
{quote}
On PDFBOX-3193, you've set affected versions to 1.8.10 and 1.8.11. Are you sure 
that that affects 1.8.10? The discovery of that wouldn't have happened unless I 
was actually running 1.8.11. 
{quote}
Indeed, sorry.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098412#comment-15098412
 ] 

Tilman Hausherr commented on TIKA-1830:
---

Another possibility is that the change I mentioned has different implications 
depending on what JDK is used. Btw these files don't have errors with the non 
sequential parser.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096866#comment-15096866
 ] 

Tilman Hausherr edited comment on TIKA-1830 at 1/14/16 5:05 PM:


I can't reproduce the difference for the file 074531.pdf. ExtractText returns 
identical results, that makes me doubt on the entire test :-(

(edit: also 362980.pdf, 058103.pdf, and 760707.pdf )

I can reproduce the difference for 290377.pdf, this is because of a change in 
decompression (rev 1709182) that tries to squeeze as much as possible from 
corrupt streams.

There may be some differences due to a bugfix related to "article beads". This 
will mean improved results for files with correct beads, but worse results for 
files where bead rectangles are incorrect.


was (Author: tilman):
I can't reproduce the difference for the file 074531.pdf. ExtractText returns 
identical results, that makes me doubt on the entire test :-(

I can reproduce the difference for 290377.pdf, this is because of a change in 
decompression (rev 1709182) that tries to squeeze as much as possible from 
corrupt streams.

There may be some differences due to a bugfix related to "article beads". This 
will mean improved results for files with correct beads, but worse results for 
files where bead rectangles are incorrect.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098418#comment-15098418
 ] 

Tilman Hausherr commented on TIKA-1830:
---

The line at {{BaseParser.java:1077}} is
{code}
COSInteger number = (COSInteger)po.remove( po.size() -1 );
{code}
po is never null, it is created earlier. Or would there be an NPE if 
{{po.remove}} returns null?

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1830) Upgrade to PDFBox 1.8.11 when available

2016-01-14 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098503#comment-15098503
 ] 

Tilman Hausherr commented on TIKA-1830:
---

Not that, but the change I mentioned
https://svn.apache.org/viewvc?view=revision&sortby=date&revision=1709182
may play a role.

> Upgrade to PDFBox 1.8.11 when available
> ---
>
> Key: TIKA-1830
> URL: https://issues.apache.org/jira/browse/TIKA-1830
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
> Attachments: reports_pdfbox_1_8_11-rc1.zip
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from XFA forms

2016-02-16 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149256#comment-15149256
 ] 

Tilman Hausherr commented on TIKA-1857:
---

Sorry, I have no experience with XFA. [~msahyoun] might know more.

> Enhance PDFParser to extract text from XFA forms
> 
>
> Key: TIKA-1857
> URL: https://issues.apache.org/jira/browse/TIKA-1857
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Pascal Essiembre
>Priority: Trivial
>  Labels: patch
> Fix For: 1.13
>
> Attachments: 041617_filled_out.pdf, xfa_in_govdocs1.txt
>
>
> Extract text from PDF Forms (XFA).  Information about XFA: 
> https://en.wikipedia.org/wiki/XFA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1989) Weird sentence in website

2016-05-28 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created TIKA-1989:
-

 Summary: Weird sentence in website
 Key: TIKA-1989
 URL: https://issues.apache.org/jira/browse/TIKA-1989
 Project: Tika
  Issue Type: Bug
  Components: documentation
Reporter: Tilman Hausherr


https://tika.apache.org/1.13/configuring.html
{quote}
To override some parser certain default behaviours, include the in your 
configuration, with excludes, then add other parser definitions in. To prevent 
the (with its auto-discovery) being used, simply omit it from your config, and 
list all other parsers you want instead.
{quote}
The sentence doesn't really make sense to me. In what? The what?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1989) Weird sentence in website

2016-05-28 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1989:
--
Description: 
https://tika.apache.org/1.13/configuring.html
{quote}
To override some parser certain default behaviours, include the in your 
configuration, with excludes, then add other parser definitions in. To prevent 
the (with its auto-discovery) being used, simply omit it from your config, and 
list all other parsers you want instead.
{quote}
The sentence doesn't really make sense to me. In what? The what?

(I was trying to help this person
https://stackoverflow.com/questions/37476055/getting-classnotfound-exception-while-running-my-program
and look at the website)

  was:
https://tika.apache.org/1.13/configuring.html
{quote}
To override some parser certain default behaviours, include the in your 
configuration, with excludes, then add other parser definitions in. To prevent 
the (with its auto-discovery) being used, simply omit it from your config, and 
list all other parsers you want instead.
{quote}
The sentence doesn't really make sense to me. In what? The what?


> Weird sentence in website
> -
>
> Key: TIKA-1989
> URL: https://issues.apache.org/jira/browse/TIKA-1989
> Project: Tika
>  Issue Type: Bug
>  Components: documentation
>Reporter: Tilman Hausherr
>
> https://tika.apache.org/1.13/configuring.html
> {quote}
> To override some parser certain default behaviours, include the in your 
> configuration, with excludes, then add other parser definitions in. To 
> prevent the (with its auto-discovery) being used, simply omit it from your 
> config, and list all other parsers you want instead.
> {quote}
> The sentence doesn't really make sense to me. In what? The what?
> (I was trying to help this person
> https://stackoverflow.com/questions/37476055/getting-classnotfound-exception-while-running-my-program
> and look at the website)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1989) Weird sentence in website

2016-05-28 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1989:
--
Description: 
https://tika.apache.org/1.13/configuring.html
{quote}
To override some parser certain default behaviours, include the in your 
configuration, with excludes, then add other parser definitions in. To prevent 
the (with its auto-discovery) being used, simply omit it from your config, and 
list all other parsers you want instead.
{quote}
The sentence doesn't really make sense to me. In what? The what?

(I was trying to help this person
https://stackoverflow.com/questions/37476055/getting-classnotfound-exception-while-running-my-program
and looked at the website)

  was:
https://tika.apache.org/1.13/configuring.html
{quote}
To override some parser certain default behaviours, include the in your 
configuration, with excludes, then add other parser definitions in. To prevent 
the (with its auto-discovery) being used, simply omit it from your config, and 
list all other parsers you want instead.
{quote}
The sentence doesn't really make sense to me. In what? The what?

(I was trying to help this person
https://stackoverflow.com/questions/37476055/getting-classnotfound-exception-while-running-my-program
and look at the website)


> Weird sentence in website
> -
>
> Key: TIKA-1989
> URL: https://issues.apache.org/jira/browse/TIKA-1989
> Project: Tika
>  Issue Type: Bug
>  Components: documentation
>Reporter: Tilman Hausherr
>
> https://tika.apache.org/1.13/configuring.html
> {quote}
> To override some parser certain default behaviours, include the in your 
> configuration, with excludes, then add other parser definitions in. To 
> prevent the (with its auto-discovery) being used, simply omit it from your 
> config, and list all other parsers you want instead.
> {quote}
> The sentence doesn't really make sense to me. In what? The what?
> (I was trying to help this person
> https://stackoverflow.com/questions/37476055/getting-classnotfound-exception-while-running-my-program
> and looked at the website)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1298) testEmbeddedPDFEmbeddingAnotherDocument fails with PDFBox 1.8.5 and java 1.6

2014-05-16 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1475#comment-1475
 ] 

Tilman Hausherr commented on TIKA-1298:
---

I strongly recommend to ship TIKA with the non sequential parser set by default.

> testEmbeddedPDFEmbeddingAnotherDocument fails with PDFBox 1.8.5 and java 1.6
> 
>
> Key: TIKA-1298
> URL: https://issues.apache.org/jira/browse/TIKA-1298
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> Not sure why this is happening.  Test works with PDFBox 1.8.5 and Java 1.7; 
> and it works with PDFBox 1.8.4 and either Java 1.6 or Java 1.7.  I'll look 
> into this now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1298) testEmbeddedPDFEmbeddingAnotherDocument fails with PDFBox 1.8.5 and java 1.6

2014-05-17 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000823#comment-14000823
 ] 

Tilman Hausherr commented on TIKA-1298:
---

Yeah, blackmail and tit-for-tat deals! Not sure if _that_ is the Apache way, 
but I have been looking into it immediately. Will write in that issue when I'm 
done with tests.

> testEmbeddedPDFEmbeddingAnotherDocument fails with PDFBox 1.8.5 and java 1.6
> 
>
> Key: TIKA-1298
> URL: https://issues.apache.org/jira/browse/TIKA-1298
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> Not sure why this is happening.  Test works with PDFBox 1.8.5 and Java 1.7; 
> and it works with PDFBox 1.8.4 and either Java 1.6 or Java 1.7.  I'll look 
> into this now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1298) testEmbeddedPDFEmbeddingAnotherDocument fails with PDFBox 1.8.5 and java 1.6

2014-05-19 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001901#comment-14001901
 ] 

Tilman Hausherr commented on TIKA-1298:
---

No problem, I know it was a joke and thought it was pretty funny :-)

> testEmbeddedPDFEmbeddingAnotherDocument fails with PDFBox 1.8.5 and java 1.6
> 
>
> Key: TIKA-1298
> URL: https://issues.apache.org/jira/browse/TIKA-1298
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
>
> Not sure why this is happening.  Test works with PDFBox 1.8.5 and Java 1.7; 
> and it works with PDFBox 1.8.4 and either Java 1.6 or Java 1.7.  I'll look 
> into this now.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1325) Move the font metadata definitions to properties

2014-06-09 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025205#comment-14025205
 ] 

Tilman Hausherr commented on TIKA-1325:
---

It is PDFBOX-2122, and I can do the change described there anytime, but I need 
a test. I'll be awake for at least 6 hours from now on and be at or near my 
computer. I'd do the change only if one of you people can test this quickly so 
I can revert this quickly if I'm wrong.

> Move the font metadata definitions to properties
> 
>
> Key: TIKA-1325
> URL: https://issues.apache.org/jira/browse/TIKA-1325
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, parser
>Affects Versions: 1.5, 1.6
>Reporter: Nick Burch
> Attachments: TIKA-1325_TimeZone.patch
>
>
> As noticed while working on TIKA-1182, the AFM font parser has a bunch of 
> hard coded strings it uses as metadata keys, while the TTF font parser 
> doesn't have many
> We should switch these to being proper Properties, with definitions from a 
> well known standard (+ compatibility fallbacks), and have both use largely 
> the same set



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1325) Move the font metadata definitions to properties

2014-06-09 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14025324#comment-14025324
 ] 

Tilman Hausherr commented on TIKA-1325:
---

PDFBOX-2122 has been fixed.

> Move the font metadata definitions to properties
> 
>
> Key: TIKA-1325
> URL: https://issues.apache.org/jira/browse/TIKA-1325
> Project: Tika
>  Issue Type: Improvement
>  Components: metadata, parser
>Affects Versions: 1.5, 1.6
>Reporter: Nick Burch
> Attachments: TIKA-1325_TimeZone.patch
>
>
> As noticed while working on TIKA-1182, the AFM font parser has a bunch of 
> hard coded strings it uses as metadata keys, while the TTF font parser 
> doesn't have many
> We should switch these to being proper Properties, with definitions from a 
> well known standard (+ compatibility fallbacks), and have both use largely 
> the same set



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-26 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045119#comment-14045119
 ] 

Tilman Hausherr commented on TIKA-1300:
---

My impression was that the NSP had better results for good PDF files. I'm 
surprised that the old parser has less problems - but then, the first two files 
of the list had incorrect Xref tables. The old parser just reads through the 
stuff even if the xref table is crap. I wonder if both parsers should in a 
team, i.e. try the first one, and if there is an exception, try the 2nd one.

Anyway, when I'm bored, I'll have a look at the files in the list.

> Switch default PDFBox parser to NonSequentialParser
> ---
>
> Key: TIKA-1300
> URL: https://issues.apache.org/jira/browse/TIKA-1300
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
> Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-26 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045119#comment-14045119
 ] 

Tilman Hausherr edited comment on TIKA-1300 at 6/26/14 9:08 PM:


My impression was that the NSP had better results for good PDF files. I'm 
surprised that the old parser has less problems - but then, the first two files 
of the list had incorrect Xref tables. The old parser just reads through the 
stuff even if the xref table is crap. I wonder if both parsers should in a 
team, i.e. try the first one, and if there is an exception, try the 2nd one.

Anyway, when I'm bored, I'll have a look at the files in the list. Here's a 
first result: PDFBOX-2163. However this is independent of the parser that is 
used.


was (Author: tilman):
My impression was that the NSP had better results for good PDF files. I'm 
surprised that the old parser has less problems - but then, the first two files 
of the list had incorrect Xref tables. The old parser just reads through the 
stuff even if the xref table is crap. I wonder if both parsers should in a 
team, i.e. try the first one, and if there is an exception, try the 2nd one.

Anyway, when I'm bored, I'll have a look at the files in the list.

> Switch default PDFBox parser to NonSequentialParser
> ---
>
> Key: TIKA-1300
> URL: https://issues.apache.org/jira/browse/TIKA-1300
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
> Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-26 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14045119#comment-14045119
 ] 

Tilman Hausherr edited comment on TIKA-1300 at 6/27/14 6:18 AM:


My impression was that the NSP had better results for good PDF files. I'm 
surprised that the old parser has less problems - but then, the first two files 
of the list had incorrect Xref tables. The old parser just reads through the 
stuff even if the xref table is crap. I wonder if both parsers should in be in 
a team, i.e. try the nonSequential one, and if there is an exception, try the 
old one.

Anyway, when I'm bored, I'll have a look at the files in the list. Here's a 
first result: PDFBOX-2163. However this is independent of the parser that is 
used.


was (Author: tilman):
My impression was that the NSP had better results for good PDF files. I'm 
surprised that the old parser has less problems - but then, the first two files 
of the list had incorrect Xref tables. The old parser just reads through the 
stuff even if the xref table is crap. I wonder if both parsers should in a 
team, i.e. try the first one, and if there is an exception, try the 2nd one.

Anyway, when I'm bored, I'll have a look at the files in the list. Here's a 
first result: PDFBOX-2163. However this is independent of the parser that is 
used.

> Switch default PDFBox parser to NonSequentialParser
> ---
>
> Key: TIKA-1300
> URL: https://issues.apache.org/jira/browse/TIKA-1300
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
> Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-27 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046288#comment-14046288
 ] 

Tilman Hausherr commented on TIKA-1300:
---

I'm not doing much with text extraction, but what we could need (and sorry if 
that is what you already do) is a diff between versions. i.e. that the 
extraction results are compared with a "current gold standard". And this could 
be done _with the snapshot versions_ of PDFBox and the other components you 
use. This way you would quickly notice if you get worse or better results, and 
don't have to wait for a release to discover a regression.

> Switch default PDFBox parser to NonSequentialParser
> ---
>
> Key: TIKA-1300
> URL: https://issues.apache.org/jira/browse/TIKA-1300
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
> Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-27 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046728#comment-14046728
 ] 

Tilman Hausherr commented on TIKA-1300:
---

I had a look at most of the files. This resulted in PDFBOX-2163 (7 files, will 
be fixed in 1.8 this Weekend) and PDFBOX-2167 (1 file). The rest is really 
broken, some of them so bad that even Acrobat can't open them. Many have 
incorrect xref tables. One has a broken LZW stream so that even Acrobat 
displays just a part of the text. One I believe I've seen before (I think 
brought up by William Palmer), it has a PDF stream that had two threads writing 
on it at the same time.

Yes, TIKA-1205 should be done.

> Switch default PDFBox parser to NonSequentialParser
> ---
>
> Key: TIKA-1300
> URL: https://issues.apache.org/jira/browse/TIKA-1300
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
> Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-28 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14046896#comment-14046896
 ] 

Tilman Hausherr commented on TIKA-1300:
---

[~talli...@mitre.org] are there any "rules" about amount of downloading from 
that digitalcorpora site? I don't see any.

> Switch default PDFBox parser to NonSequentialParser
> ---
>
> Key: TIKA-1300
> URL: https://issues.apache.org/jira/browse/TIKA-1300
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
> Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1300) Switch default PDFBox parser to NonSequentialParser

2014-06-29 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047095#comment-14047095
 ] 

Tilman Hausherr commented on TIKA-1300:
---

{quote}
Make sure to delete handful of infected files
{quote}
I hope that current antivirus sw detects these files. Is this on purpose from 
that "digitalcorpora" site, or were these (government) files already infected 
at the time they were collected?

> Switch default PDFBox parser to NonSequentialParser
> ---
>
> Key: TIKA-1300
> URL: https://issues.apache.org/jira/browse/TIKA-1300
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Fix For: 1.7
>
> Attachments: tika_1_6_ClassicsVsNonSeq.zip
>
>
> On TIKA-1298, [~tilman] recommended switching Tika's default to the 
> NonSequentialParser. We added a parameter to use the NonSequentialParser in 
> TIKA-1201, and there's some good discussion there about the benefits.
> Is the community in favor of switching the default now?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1372) PDCheckbox NPE

2014-07-22 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created TIKA-1372:
-

 Summary: PDCheckbox NPE
 Key: TIKA-1372
 URL: https://issues.apache.org/jira/browse/TIKA-1372
 Project: Tika
  Issue Type: Bug
Reporter: Tilman Hausherr


One of your users, [~mdhussain], opened PDFBOX-2218:

PDF parsing fails for attached PDF.
Stack trace of failure:
{code}
Exception in thread "main" org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.pdf.PDFParser@1747c
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at 
com.sabax.extraction.FileExtractionHandler.getFileData(FileExtractionHandler.java:145)
at GenerateIndex.main(GenerateIndex.java:59)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
Caused by: java.lang.NullPointerException
at 
org.apache.pdfbox.pdmodel.interactive.form.PDCheckbox.getOnValue(PDCheckbox.java:141)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDCheckbox.isChecked(PDCheckbox.java:79)
at 
org.apache.pdfbox.pdmodel.interactive.form.PDRadioCollection.getValue(PDRadioCollection.java:128)
at 
org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:507)
at 
org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:461)
at 
org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:479)
at 
org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:447)
at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:195)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:341)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:106)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 9 more
{code}

Sample code use to parse
{code}
TikaInputStream tikaStream = TikaInputStream.get(stream);
TikaResultWrapper result;
try {
  long streamSize = tikaStream.getLength();
  Metadata metadata =
constructMetadata(fileName, mimeType, streamSize);
  if (streamSize < maxFileSize) {
SamplingSaxHandler handler =
  new SamplingSaxHandler(samplingSize, metadata);
handler.setBufferLimit(bufferSize);
parser.parse(tikaStream, handler, metadata, new ParseContext());
result = handler.getResult();
  } else {
result = new TikaResultWrapper(null, metadata);
  }
} finally {
  tikaStream.close();
}
return result;
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1372) PDCheckbox NPE

2014-07-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070847#comment-14070847
 ] 

Tilman Hausherr commented on TIKA-1372:
---

IMHO the cause is TIKA not doing some null checks for fields without a name and 
without a value. See the attached files in PDFBOX-2218, especially the 
printfields output.

> PDCheckbox NPE
> --
>
> Key: TIKA-1372
> URL: https://issues.apache.org/jira/browse/TIKA-1372
> Project: Tika
>  Issue Type: Bug
>Reporter: Tilman Hausherr
>
> One of your users, [~mdhussain], opened PDFBOX-2218:
> PDF parsing fails for attached PDF.
> Stack trace of failure:
> {code}
> Exception in thread "main" org.apache.tika.exception.TikaException: 
> Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@1747c
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>   at 
> com.sabax.extraction.FileExtractionHandler.getFileData(FileExtractionHandler.java:145)
>   at GenerateIndex.main(GenerateIndex.java:59)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>   at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>   at java.lang.reflect.Method.invoke(Unknown Source)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.pdfbox.pdmodel.interactive.form.PDCheckbox.getOnValue(PDCheckbox.java:141)
>   at 
> org.apache.pdfbox.pdmodel.interactive.form.PDCheckbox.isChecked(PDCheckbox.java:79)
>   at 
> org.apache.pdfbox.pdmodel.interactive.form.PDRadioCollection.getValue(PDRadioCollection.java:128)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.addFieldString(PDF2XHTML.java:507)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:461)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.processAcroField(PDF2XHTML.java:479)
>   at 
> org.apache.tika.parser.pdf.PDF2XHTML.extractAcroForm(PDF2XHTML.java:447)
>   at org.apache.tika.parser.pdf.PDF2XHTML.endDocument(PDF2XHTML.java:195)
>   at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:341)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:106)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   ... 9 more
> {code}
> Sample code use to parse
> {code}
> TikaInputStream tikaStream = TikaInputStream.get(stream);
> TikaResultWrapper result;
> try {
>   long streamSize = tikaStream.getLength();
>   Metadata metadata =
> constructMetadata(fileName, mimeType, streamSize);
>   if (streamSize < maxFileSize) {
> SamplingSaxHandler handler =
>   new SamplingSaxHandler(samplingSize, metadata);
> handler.setBufferLimit(bufferSize);
> parser.parse(tikaStream, handler, metadata, new ParseContext());
> result = handler.getResult();
>   } else {
> result = new TikaResultWrapper(null, metadata);
>   }
> } finally {
>   tikaStream.close();
> }
> return result;
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145061#comment-14145061
 ] 

Tilman Hausherr commented on TIKA-1419:
---

Thanks for making these tests. Would it be possible that next time, you do them 
before the release is cut? Andreas usually tells in advance when he's planning 
to make a new release. This would allow us to fix regressions before release.

> Upgrade to PDFBox 1.8.7
> ---
>
> Key: TIKA-1419
> URL: https://issues.apache.org/jira/browse/TIKA-1419
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14145399#comment-14145399
 ] 

Tilman Hausherr commented on TIKA-1419:
---

Maybe you could create a project for GSoC2015 about TIKA-1302. My own 
investment as a mentor was about 10-15 hours a week. It could probably be less 
for you, I used a lot of time creating tests with PostScript (while knowing 
nothing about it, LOL) and keeping a very close eye on the code. But the result 
was worth it, the code is now in PDFBOX 1.8.7 (nothing relevant to TIKA).

Anyway, what I could do is to email you when Andreas seems to be ready, because 
your tests are very valuable for us. There's a lot of traffic in the list 
sometimes.

> Upgrade to PDFBox 1.8.7
> ---
>
> Key: TIKA-1419
> URL: https://issues.apache.org/jira/browse/TIKA-1419
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-27 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1419:
--
Attachment: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.xlsx

Here's an excel file, on the new column on the right I wrote which files 
improved by solving the three related PDFBox issues above. I mostly tested the 
files that had less tokens. I tested a few that had more tokens, there the 
results are inconclusive. Some have improved, some had more tokens due to a 
regression that has been solved now.

Would it be possible, the next time, to test with the same set of files, and 
test not 1.8.8 against 1.8.7, but rather 1.8.8 against 1.8.6? The reason is 
that if there's an unknown regression in 1.8.7, and this isn't solved, 1.8.8 
would look as if there's the same quality, but it is not.

> Upgrade to PDFBox 1.8.7
> ---
>
> Key: TIKA-1419
> URL: https://issues.apache.org/jira/browse/TIKA-1419
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv, 
> compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.xlsx
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-09-29 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152855#comment-14152855
 ] 

Tilman Hausherr commented on TIKA-1419:
---

Compare PDFBox's trunk against 1.8.x periodically would make sense, of course. 
There's a comment in PDFBOX-2377 "the current trunk extracts nothing but 
rubbish from 705042.pdf" so this makes me wonder what else has been "lost" in 
the trunk.

Re checking 1.8.8 v. 1.8.6 - if it isn't too much work, as soon as you have the 
time, even if there isn't a new release planned now. The regression you found 
is very embarassing, and it is the first time I realize that a wrong decision 
in the recognition of inline images (detecting whether "EI" is within an image 
or is the end of the image) results in cut off text extraction.

> Upgrade to PDFBox 1.8.7
> ---
>
> Key: TIKA-1419
> URL: https://issues.apache.org/jira/browse/TIKA-1419
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv, 
> compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.xlsx
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1427) PDF Images don't appear in structured view

2014-10-09 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14165652#comment-14165652
 ] 

Tilman Hausherr commented on TIKA-1427:
---

The first image ("Im1") is painted with "q 433 0 0 324.95 81.1 369.02 cm /Im1 
Do Q". The second "image" is about 99% of that (huge) stream and it is is just 
a lot of lines and shapes. There's no way to save it. You answered your 
question yourself :-)

(And there's also no inline image in the stream, and even if there was, we 
don't save these in 1.8)

> PDF Images don't appear in structured view
> --
>
> Key: TIKA-1427
> URL: https://issues.apache.org/jira/browse/TIKA-1427
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: James Baker
>Assignee: Tim Allison
>  Labels: pdf
> Attachments: images_test.pdf
>
>
> When viewing, say, a Word Document, any images appear in the 'structured 
> view' of the document as  tags. The same is not true of PDF documents, 
> and we lose both the fact that there is an image present, and where it is in 
> the document.
> Some discussion of this issue in the comments of TIKA-1396.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1419) Upgrade to PDFBox 1.8.7

2014-10-09 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1419:
--
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx

Thank you [~talli...@apache.org], here's the result of some manual analysing. 
The good news is that I found a few improvements, and only two regressions, and 
no case of "smaller results" like with 1.8.7. Here's some suggestions how the 
automatic analysis could be improved:

- dictionary, or maybe just count a few common english words with at least 
three characters ( https://en.wikipedia.org/wiki/Most_common_words_in_English 
), i.e. to ignore files that are mostly made of trash (although the trash 
changes)
- deleting files from the test set that are known to be corrupt, or won't get 
any useful text even in adobe reader, so that the manual investigation isn't 
done each time.

I analysed only cases where there were no exceptions. Within the next few days, 
I'll investigate some of the cases where there are still exceptions, however 
most of these are corrupt files, that even Adobe Reader doesn't display.

> Upgrade to PDFBox 1.8.7
> ---
>
> Key: TIKA-1419
> URL: https://issues.apache.org/jira/browse/TIKA-1419
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Minor
> Attachments: compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.csv, 
> compare_Tika-trunk-1.7_w_PDFBox1.8.6Vs.1.8.7.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOT.zip
>
>
> Will run against govdocs1 early next week and then upgrade if no major 
> regressions are found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-10 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167194#comment-14167194
 ] 

Tilman Hausherr commented on TIKA-1442:
---

Do you want the junk list in some format? Just the six digits, or the directory 
too?

Because this manual checking takes a lot of time, I'm planning to download the 
entire directory and store the PDFs only.

We should agree on criteria for exclusion. Suggestion:
- files that at some place don't display with Adobe Reader (this applies to 
most, if not all the files that have exceptions with LZW or Flate)
- files that do display, but have only junk when doing copy & paste in Adobe 
Reader

Re 1.8.8 yes I'm obviously unhappy with 1.8.7. But the token comparison should 
first be improved so that we're really sure not to have major regressions. 
(Although I'm optimistic based on your tests that show only one smaller 
regression)

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-15 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14172978#comment-14172978
 ] 

Tilman Hausherr commented on TIKA-1442:
---

files that have only junk as text with AR:

661/661834.pdf
565/565010.pdf
248/248787.pdf
979/979474.pdf
831/831528.pdf
638/638488.pdf
878/878499.pdf
503/503035.pdf
289/289669.pdf

file that has a possible virus:
345/345947.pdf (wasn't in the last test set)

files that have an error when opening with AR (although they can be displayed):
092/092919.pdf
435/435321.pdf
995/995773.pdf
078/078278.pdf
210/210260.pdf
219/219789.pdf
230/230877.pdf
268/268554.pdf
367/367594.pdf
392/392154.pdf
475/475121.pdf
477/477047.pdf
551/551464.pdf
615/615614.pdf
707/707505.pdf
714/714002.pdf
738/738627.pdf
819/819127.pdf
101/101819.pdf
359/359872.pdf
523/523690.pdf

Surprisingly, some files with LZW errors do display with AR without an error 
message. Either AR keeps quiet about it, or there is still a bug in the LZW 
decoder. Both could be possible, AR doesn't show every error, and the PDFBox 
LZW decoder is 
[tricky|https://issues.apache.org/jira/issues/?jql=labels%20%3D%20LZW].

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-15 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-16 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: (was: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx)

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-16 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173983#comment-14173983
 ] 

Tilman Hausherr commented on TIKA-1442:
---

After some more research, I was able to decode 5 more files (the cause was not 
the LZW filter, see ). However 7 other files are really corrupt, portions of 
the files are blank when shown in AR:

115/115269.pdf
211/211876.pdf
268/268346.pdf
389/389474.pdf
443/443752.pdf
698/698813.pdf
846/846759.pdf

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-16 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180302#comment-14180302
 ] 

Tilman Hausherr commented on TIKA-1442:
---

{quote}
and recommend other statistics that would be useful for file comparison
{quote}
I'd like to get the full exception like you had before. This time there's only 
the first line. For example, the file 272372.pdf had a problem with meta data 
(text extracts fine) that I thought I had fixed, and not there's again an 
exception and I wonder where.

Did you use the latest 1.8.8 version or the same version as last time but with 
the new statistics?

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180440#comment-14180440
 ] 

Tilman Hausherr commented on TIKA-1442:
---

Whats also missing this time is the token count, which was there last time.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180302#comment-14180302
 ] 

Tilman Hausherr edited comment on TIKA-1442 at 10/22/14 8:06 PM:
-

{quote}
and recommend other statistics that would be useful for file comparison
{quote}
-I'd like to get the full exception like you had before. This time there's only 
the first line. For example, the file 272372.pdf had a problem with meta data 
(text extracts fine) that I thought I had fixed, and not there's again an 
exception and I wonder where.-

Did you use the latest 1.8.8 version or the same version as last time but with 
the new statistics?


was (Author: tilman):
{quote}
and recommend other statistics that would be useful for file comparison
{quote}
I'd like to get the full exception like you had before. This time there's only 
the first line. For example, the file 272372.pdf had a problem with meta data 
(text extracts fine) that I thought I had fixed, and not there's again an 
exception and I wonder where.

Did you use the latest 1.8.8 version or the same version as last time but with 
the new statistics?

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180446#comment-14180446
 ] 

Tilman Hausherr commented on TIKA-1442:
---

Sorry, ignore my text re: 1st line only. It's all there.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180469#comment-14180469
 ] 

Tilman Hausherr commented on TIKA-1442:
---

{quote}
Should I add token count? 
{quote}
Yes please... in theory, it could be that a "missing" page has similar tokens 
than in a non-missing part, so the unique token count would not change.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180636#comment-14180636
 ] 

Tilman Hausherr commented on TIKA-1442:
---

Which are the top10words? I ask because 554/554384.pdf has only five of them.

I've now found a strategy... First, I've added a new column that the new word 
count by the old word count. If the result is smaller than 1, treat it as 
suspicious - but not, if both have zero top10words. The file I mention has 5 (0 
before) so the file has improved, and it is not a regression.

Another strategy would be to look for files with less top10words, this would 
likely be a regression. Will probably add a column with a formula for that one.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-22 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14180687#comment-14180687
 ] 

Tilman Hausherr commented on TIKA-1442:
---

Or does the top10words mean how many stop words are in the top 10 list of 
words? (that are in another column)

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181779#comment-14181779
 ] 

Tilman Hausherr commented on TIKA-1442:
---

Thanks!

I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is 
encrypted and has no text extract permission. But the line in the excel file 
does have tokens, which is, uh, surprising.

With the "old" parser, use this code, because files are sometimes encrypted 
with the empty password:
{code}
if( document.isEncrypted() )
{
try
{
StandardDecryptionMaterial sdm = new 
StandardDecryptionMaterial("");
document.openProtection(sdm);
}
catch( InvalidPasswordException e )
{
System.err.println( "Error: The document is encrypted." );
}
}
{code}
The nonSeq parser does this automatically.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181779#comment-14181779
 ] 

Tilman Hausherr edited comment on TIKA-1442 at 10/23/14 7:31 PM:
-

Thanks!

I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is 
encrypted and has no text extract permission. But the line in the excel file 
does have tokens, which is, uh, surprising.

With the "old" parser, use this code, because files are sometimes encrypted 
with the empty password:
{code}
if( document.isEncrypted() )
{
try
{
StandardDecryptionMaterial sdm = new 
StandardDecryptionMaterial("");
document.openProtection(sdm);
}
catch( InvalidPasswordException e )
{
System.err.println( "Error: The document is encrypted." );
}
}
{code}
The nonSeq parser does this automatically.


Same for 892/892859.pdf


was (Author: tilman):
Thanks!

I'm slowly starting, and here's the first thing: 892/892848.pdf, this file is 
encrypted and has no text extract permission. But the line in the excel file 
does have tokens, which is, uh, surprising.

With the "old" parser, use this code, because files are sometimes encrypted 
with the empty password:
{code}
if( document.isEncrypted() )
{
try
{
StandardDecryptionMaterial sdm = new 
StandardDecryptionMaterial("");
document.openProtection(sdm);
}
catch( InvalidPasswordException e )
{
System.err.println( "Error: The document is encrypted." );
}
}
{code}
The nonSeq parser does this automatically.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14181813#comment-14181813
 ] 

Tilman Hausherr commented on TIKA-1442:
---

The directory structure isn't a problem for me, I've downloaded all PDF files 
locally on a flat directory. Currently I'm still checking the files by hand, 
but I'll probably write a small script to extract and render with the different 
versions.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip

I'm done now; the result is two new issues, PDFBOX-2448 and PDFBOX-2449. 
However PDFBOX-2448 isn't relevant to 1.8.8.

Many changes are positive ones, files that no longer thrown an exception, or 
files that have better text extraction.


> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-23 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14182047#comment-14182047
 ] 

Tilman Hausherr commented on TIKA-1442:
---

A few files have less meta data than before:
019/019837.pdf
138/138155.pdf
221/221001.pdf
224/224644.pdf
308/308233.pdf
469/469387.pdf
490/490345.pdf
490/490344.pdf
597/597244.pdf
643/643910.pdf

Could you tell what you get in TIKA for the first one?

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-10-24 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14173983#comment-14173983
 ] 

Tilman Hausherr edited comment on TIKA-1442 at 10/24/14 11:02 AM:
--

After some more research, I was able to decode 5 more files (the cause was not 
the LZW filter, see PDFBOX-2296, but I fixed this only in 2.0). However 7 other 
files are really corrupt, portions of the files are blank when shown in AR:

115/115269.pdf
211/211876.pdf
268/268346.pdf
389/389474.pdf
443/443752.pdf
698/698813.pdf
846/846759.pdf


was (Author: tilman):
After some more research, I was able to decode 5 more files (the cause was not 
the LZW filter, see ). However 7 other files are really corrupt, portions of 
the files are blank when shown in AR:

115/115269.pdf
211/211876.pdf
268/268346.pdf
389/389474.pdf
443/443752.pdf
698/698813.pdf
846/846759.pdf

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.7
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1467) pdf:encrypted:false with encrypted pdf

2014-11-07 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202456#comment-14202456
 ] 

Tilman Hausherr commented on TIKA-1467:
---

The old and the new parser have different approaches to decryption. In the old 
one, you have to decrypt yourself with openProtection(). With the new one, you 
pass the password (no password = empty password) to loadNonSeq and it done 
immediately. So the document is no longer encrypted when loadNonSeq() returns. 
I don't know how to find out whether it was encrypted. [~lehmi] any idea?

> pdf:encrypted:false with encrypted pdf
> --
>
> Key: TIKA-1467
> URL: https://issues.apache.org/jira/browse/TIKA-1467
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
> Environment: $java -version
> java version "1.6.0_25"
> Java(TM) SE Runtime Environment (build 1.6.0_25-b06)
> Java HotSpot(TM) Client VM (build 20.0-b11, mixed mode, sharing)
>Reporter: Thomas Ledoux
>
> When extracting metadata from the encryption_noprinting.pdf file found in the 
> pdfCabinetOfHorrors 
> (https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors)
> $java -jar tika-app-1.7-20141105.092424-471.jar -j encryption_noprinting.pdf
> We get a 
> INFO - Document is encrypted
> but the resulting JSON has : "pdf:encrypted":"false"
> Looking at the PDFParser, it seems that the first information comes when 
> reading the PDF but when the metadata is retrieve the PDF is no longer 
> encrypted... the encryption fact should be retain to be added to the metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1467) pdf:encrypted:false with encrypted pdf

2014-11-07 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14202456#comment-14202456
 ] 

Tilman Hausherr edited comment on TIKA-1467 at 11/7/14 10:22 PM:
-

The old and the new parser have different approaches to decryption. In the old 
one, you have to decrypt yourself with openProtection(). With the new one, you 
pass the password (no password = empty password) to loadNonSeq and it is done 
immediately. So the document is no longer encrypted when loadNonSeq() returns. 
I don't know how to find out whether it was encrypted. [~lehmi] any idea?


was (Author: tilman):
The old and the new parser have different approaches to decryption. In the old 
one, you have to decrypt yourself with openProtection(). With the new one, you 
pass the password (no password = empty password) to loadNonSeq and it done 
immediately. So the document is no longer encrypted when loadNonSeq() returns. 
I don't know how to find out whether it was encrypted. [~lehmi] any idea?

> pdf:encrypted:false with encrypted pdf
> --
>
> Key: TIKA-1467
> URL: https://issues.apache.org/jira/browse/TIKA-1467
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
> Environment: $java -version
> java version "1.6.0_25"
> Java(TM) SE Runtime Environment (build 1.6.0_25-b06)
> Java HotSpot(TM) Client VM (build 20.0-b11, mixed mode, sharing)
>Reporter: Thomas Ledoux
>
> When extracting metadata from the encryption_noprinting.pdf file found in the 
> pdfCabinetOfHorrors 
> (https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors)
> $java -jar tika-app-1.7-20141105.092424-471.jar -j encryption_noprinting.pdf
> We get a 
> INFO - Document is encrypted
> but the resulting JSON has : "pdf:encrypted":"false"
> Looking at the PDFParser, it seems that the first information comes when 
> reading the PDF but when the metadata is retrieve the PDF is no longer 
> encrypted... the encryption fact should be retain to be added to the metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225008#comment-14225008
 ] 

Tilman Hausherr commented on TIKA-1442:
---

Thanks Tim!

892848.pdf and 892859.pdf should return nothing because they have no extract 
permission, yet they have 1000s of tokens in the table? PDFBox ExtractText 
brings an IOException that there is no text extraction permission.

357567.pdf, 267739.pdf and 686183.pdf are unfixed regressions PDFBOX-2421 and 
PDFBOX-2449.

Not PDFs:
196/196578.pdf
371/371231.pdf
879/879483.pdf
892/892042.pdf

890238.pdf is a regression, but only with the old parser. 
(IllegalBlockSizeException). I think this one was mentioned elsewhere.

474863.pdf is also a regression (IllegalBlockSizeException), with both parsers.

more to come...





> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225008#comment-14225008
 ] 

Tilman Hausherr edited comment on TIKA-1442 at 11/25/14 8:38 PM:
-

Thanks Tim!

892848.pdf and 892859.pdf should return nothing because they have no extract 
permission, yet they have 1000s of tokens in the table? PDFBox ExtractText 
brings an IOException that there is no text extraction permission.

357567.pdf, 267739.pdf and 686183.pdf are unfixed regressions PDFBOX-2421 and 
PDFBOX-2449.

Not PDFs:
196/196578.pdf
371/371231.pdf
879/879483.pdf
892/892042.pdf

890238.pdf is a regression, but only with the old parser. 
(IllegalBlockSizeException). I think this one was mentioned elsewhere.

474863.pdf is also a regression (IllegalBlockSizeException), with both parsers, 
I just created PDFBOX-2522.

That's it... besides that, no surprises.


was (Author: tilman):
Thanks Tim!

892848.pdf and 892859.pdf should return nothing because they have no extract 
permission, yet they have 1000s of tokens in the table? PDFBox ExtractText 
brings an IOException that there is no text extraction permission.

357567.pdf, 267739.pdf and 686183.pdf are unfixed regressions PDFBOX-2421 and 
PDFBOX-2449.

Not PDFs:
196/196578.pdf
371/371231.pdf
879/879483.pdf
892/892042.pdf

890238.pdf is a regression, but only with the old parser. 
(IllegalBlockSizeException). I think this one was mentioned elsewhere.

474863.pdf is also a regression (IllegalBlockSizeException), with both parsers.

more to come...





> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: PDFBox_1_8_6VPDFBox_1_8_8-b145.zip

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225283#comment-14225283
 ] 

Tilman Hausherr commented on TIKA-1442:
---

[~talli...@apache.org] I'm really wondering why you'd get any extracted text 
from e.g. 717226.pdf, because it has no extract permission. The permissions in 
PDF files are only enforced by the application (i.e. PDFBox), i.e. the text 
information isn't stored separately in encrypted form.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225008#comment-14225008
 ] 

Tilman Hausherr edited comment on TIKA-1442 at 11/25/14 10:08 PM:
--

Thanks Tim!

892848.pdf and 892859.pdf should return nothing because they have no extract 
permission, yet they have 1000s of tokens in the table? PDFBox ExtractText 
brings an IOException that there is no text extraction permission.

357567.pdf, 267739.pdf and 686183.pdf are unfixed regressions PDFBOX-2421 and 
PDFBOX-2449.

Not PDFs:
196/196578.pdf
371/371231.pdf
879/879483.pdf
892/892042.pdf

890238.pdf is a regression, but only with the old parser. 
(IllegalBlockSizeException). I think this one was mentioned elsewhere.

474863.pdf is also a regression (IllegalBlockSizeException), -with both 
parsers-, I just created PDFBOX-2522.

That's it... besides that, no surprises.


was (Author: tilman):
Thanks Tim!

892848.pdf and 892859.pdf should return nothing because they have no extract 
permission, yet they have 1000s of tokens in the table? PDFBox ExtractText 
brings an IOException that there is no text extraction permission.

357567.pdf, 267739.pdf and 686183.pdf are unfixed regressions PDFBOX-2421 and 
PDFBOX-2449.

Not PDFs:
196/196578.pdf
371/371231.pdf
879/879483.pdf
892/892042.pdf

890238.pdf is a regression, but only with the old parser. 
(IllegalBlockSizeException). I think this one was mentioned elsewhere.

474863.pdf is also a regression (IllegalBlockSizeException), with both parsers, 
I just created PDFBOX-2522.

That's it... besides that, no surprises.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225008#comment-14225008
 ] 

Tilman Hausherr edited comment on TIKA-1442 at 11/25/14 11:08 PM:
--

Thanks Tim!

892848.pdf and 892859.pdf should return nothing because they have no extract 
permission, yet they have 1000s of tokens in the table? PDFBox ExtractText 
brings an IOException that there is no text extraction permission.

357567.pdf, 267739.pdf and 686183.pdf are unfixed regressions PDFBOX-2421 and 
PDFBOX-2449.

Not PDFs:
196/196578.pdf
371/371231.pdf
879/879483.pdf
892/892042.pdf

That's it... besides that, no surprises.


was (Author: tilman):
Thanks Tim!

892848.pdf and 892859.pdf should return nothing because they have no extract 
permission, yet they have 1000s of tokens in the table? PDFBox ExtractText 
brings an IOException that there is no text extraction permission.

357567.pdf, 267739.pdf and 686183.pdf are unfixed regressions PDFBOX-2421 and 
PDFBOX-2449.

Not PDFs:
196/196578.pdf
371/371231.pdf
879/879483.pdf
892/892042.pdf

890238.pdf is a regression, but only with the old parser. 
(IllegalBlockSizeException). I think this one was mentioned elsewhere.

474863.pdf is also a regression (IllegalBlockSizeException), -with both 
parsers-, I just created PDFBOX-2522.

That's it... besides that, no surprises.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-25 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225867#comment-14225867
 ] 

Tilman Hausherr commented on TIKA-1442:
---

Ok, will do.
About the seq vs. nonSeq test: this will take some more time to understand, 
I've already opened PDFBOX-2523 for the problem that comes most.
However I've also seen files where the non sequential parser has one page more, 
e.g. 535691.pdf, 352706.pdf and 212019.pdf.
About "testing the full 250k": Hmmm not now. Most, if not all, of the 
differences will be similar to the ones found in the current subset, of the 
kind I already have, e.g. pages with trash text extraction where the trash is 
different.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1489) PDF Text extraction without permission

2014-11-25 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created TIKA-1489:
-

 Summary: PDF Text extraction without permission
 Key: TIKA-1489
 URL: https://issues.apache.org/jira/browse/TIKA-1489
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Tilman Hausherr


In TIKA-1442 text extraction from files like 717226.pdf that don't have text 
extraction permission works. The permissions in PDF files are only enforced by 
the application (i.e. PDFBox), i.e. the text information isn't stored 
separately in encrypted form. 

PDFBox ExtractText command line does throw an exception.
So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call 
used bypasses the permission checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1489) PDF Text extraction without permission

2014-11-26 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14226500#comment-14226500
 ] 

Tilman Hausherr commented on TIKA-1489:
---

No, permissions are connected to encryption. Encrypted files have two 
passwords: the user and the owner password. The user password, if correct (it 
is often empty), allows to view the file but restricts certain permissions, and 
very often to extract the text. The owner password allows to "do everything".

Tika PDF2XHTML.java doesn't have any check for permissions, and neither does 
the parent class PDFTextStripper. Oh, oh.
{quote}Again, if I understand correctly, Tilman Hausherr's point is that 
applications have a responsibility to respect the document's desired access 
irrespective of encryption.{quote}
That is correct.


> PDF Text extraction without permission
> --
>
> Key: TIKA-1489
> URL: https://issues.apache.org/jira/browse/TIKA-1489
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Tilman Hausherr
>
> In TIKA-1442 text extraction from files like 717226.pdf that don't have text 
> extraction permission works. The permissions in PDF files are only enforced 
> by the application (i.e. PDFBox), i.e. the text information isn't stored 
> separately in encrypted form. 
> PDFBox ExtractText command line does throw an exception.
> So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call 
> used bypasses the permission checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-29 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx

Here's my evaluation of the test. I wasn't finished, but it would be nice to 
use my comments in the next test.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-30 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-30 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr updated TIKA-1442:
--
Attachment: (was: PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx)

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-30 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14228968#comment-14228968
 ] 

Tilman Hausherr edited comment on TIKA-1442 at 11/30/14 10:49 PM:
--

Here's my evaluation of the test. I didn't test all files, but I think the 
issues found apply to several files. It would be nice to use my comments in the 
next test.


was (Author: tilman):
Here's my evaluation of the test. I wasn't finished, but it would be nice to 
use my comments in the next test.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1489) PDF Text extraction without permission

2014-12-01 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230193#comment-14230193
 ] 

Tilman Hausherr commented on TIKA-1489:
---

[~talli...@mitre.org] I can't tell you what to do. It is more the thought about 
a (possible) legal problem, and about having a good relationship with Adobe, 
and not being positioned as a cracking application. But I'm not going to snitch 
:-)

About permissions:
CanAssembleDocument
CanExtractContent
CanExtractForAccessibility
CanFillInForm
CanModify
CanModifyAnnotations
CanPrint
CanPrintDegraded
ReadOnly


> PDF Text extraction without permission
> --
>
> Key: TIKA-1489
> URL: https://issues.apache.org/jira/browse/TIKA-1489
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Tilman Hausherr
>
> In TIKA-1442 text extraction from files like 717226.pdf that don't have text 
> extraction permission works. The permissions in PDF files are only enforced 
> by the application (i.e. PDFBox), i.e. the text information isn't stored 
> separately in encrypted form. 
> PDFBox ExtractText command line does throw an exception.
> So I wonder why TIKA is able to extract text. Either TIKA or the PDFBox call 
> used bypasses the permission checking.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-12-01 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14230589#comment-14230589
 ] 

Tilman Hausherr commented on TIKA-1442:
---

Weird thing in the 1.8.6 vs 1.8.8 test: according to the excel file, 301125.pdf 
worked in 1.8.6 but not in 1.8.8 (empty top n words). There are no exceptions 
in the exceptions column. When I test it, it fails with both with an exception.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: PDFBox_1_8_6DVPDFBox_1_8_8-TRAD-b156.xlsx, 
> PDFBox_1_8_6VPDFBox_1_8_8-b145.xlsx, PDFBox_1_8_6VPDFBox_1_8_8-b145.zip, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> PDFBox_1_8_8-ClassicVPDFBox_1_8_8-NonSeq.xlsx, 
> PDFBox_1_8_8-TRADVPDFBox_1_8_8-NONSEQ-b156.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   >