[jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note

2014-06-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040519#comment-14040519
 ] 

Hong-Thai Nguyen commented on TIKA-1350:


Richard Johnson (author of java-pstlib) is trying deploy new version 0.8.1 to 
Maven Center (ref. 
https://issues.sonatype.org/browse/OSSRH-8965?focusedCommentId=260254page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-260254)

When this work done, we can upgrade to 0.8.1 in Tika dependence to get fix.

 OutlookPSTParser: Unknown message type: IPM.Note
 

 Key: TIKA-1350
 URL: https://issues.apache.org/jira/browse/TIKA-1350
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Jonathan Evans
  Labels: libpst, parser, pst
 Fix For: 1.7

   Original Estimate: 0.2h
  Remaining Estimate: 0.2h

 When parsing some emails in a PST file I get the error Unknown message type: 
 IPM.Note preventing them from being parsed. This is because of an extra null 
 byte at the end of the message class string.
 This has been fixed in version 0.8.1 of java-libpst so a version bump is all 
 that is required. 
 https://github.com/rjohnsondev/java-libpst/issues/14
 I would attempt to do this myself but I am unsure how to open a pull request 
 with SVN.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-06-23 Thread William Palmer (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040623#comment-14040623
 ] 

William Palmer commented on TIKA-1232:
--

I am currently out of the office and will be back on Thursday 26th June 2014.

Any FOI requests should be sent to foi-enquir...@bl.uk.


**
Experience the British Library online at www.bl.ukhttp://www.bl.uk/
The British Library’s latest Annual Report and Accounts : 
www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/aboutus/annrep/index.html
Help the British Library conserve the world's knowledge. Adopt a Book. 
www.bl.uk/adoptabookhttp://www.bl.uk/adoptabook
The Library's St Pancras site is WiFi - enabled
*
The information contained in this e-mail is confidential and may be legally 
privileged. It is intended for the addressee(s) only. If you are not the 
intended recipient, please delete this e-mail and notify the 
postmas...@bl.ukmailto:postmas...@bl.uk : The contents of this e-mail must 
not be disclosed or copied without the sender's consent.
The statements and opinions expressed in this message are those of the author 
and do not necessarily reflect those of the British Library. The British 
Library does not take any responsibility for the views of the author.
*
Think before you print


 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, 
 testComment.pdf


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note

2014-06-23 Thread kevin slote
What tika version will have the pst support?


On Mon, Jun 23, 2014 at 4:23 AM, Hong-Thai Nguyen (JIRA) j...@apache.org
wrote:


 [
 https://issues.apache.org/jira/browse/TIKA-1350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040519#comment-14040519
 ]

 Hong-Thai Nguyen commented on TIKA-1350:
 

 Richard Johnson (author of java-pstlib) is trying deploy new version 0.8.1
 to Maven Center (ref.
 https://issues.sonatype.org/browse/OSSRH-8965?focusedCommentId=260254page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-260254
 )

 When this work done, we can upgrade to 0.8.1 in Tika dependence to get fix.

  OutlookPSTParser: Unknown message type: IPM.Note
  
 
  Key: TIKA-1350
  URL: https://issues.apache.org/jira/browse/TIKA-1350
  Project: Tika
   Issue Type: Bug
   Components: parser
 Affects Versions: 1.7
 Reporter: Jonathan Evans
   Labels: libpst, parser, pst
  Fix For: 1.7
 
Original Estimate: 0.2h
   Remaining Estimate: 0.2h
 
  When parsing some emails in a PST file I get the error Unknown message
 type: IPM.Note preventing them from being parsed. This is because of an
 extra null byte at the end of the message class string.
  This has been fixed in version 0.8.1 of java-libpst so a version bump is
 all that is required.
  https://github.com/rjohnsondev/java-libpst/issues/14
  I would attempt to do this myself but I am unsure how to open a pull
 request with SVN.



 --
 This message was sent by Atlassian JIRA
 (v6.2#6252)



[jira] [Commented] (TIKA-1232) Add PDF version to PDFParser output

2014-06-23 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040816#comment-14040816
 ] 

Tyler Palsulich commented on TIKA-1232:
---

Hey [~talli...@mitre.org]. A couple -- TIKA-758 will need an update (I put up a 
patch a few days ago corresponding to an older version upgrade, too) and the 
workaround in TIKA-1325 can be removed (since PDFBOX-2122 is resolved). 

 Add PDF version to PDFParser output
 ---

 Key: TIKA-1232
 URL: https://issues.apache.org/jira/browse/TIKA-1232
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
 Environment: JDK6
Reporter: William Palmer
Assignee: Tim Allison
Priority: Minor
 Attachments: Sample 10.x.pdf, Sample 11.x PDFA-1b.pdf, Sample 
 4.x.pdf, Sample 5.x.pdf, Sample 6.x.pdf, Sample 7.x.pdf, Sample 8.x.pdf, 
 Sample 9.x.pdf, TIKA-1232v1.patch, TIKA-1232v2.patch, pdfversion.patch, 
 testComment.pdf


 I'd like to identify the PDF version of files, this is not currently reported 
 by the PDFParser although the information is available via PDFBox.  I have 
 attached a patch that adds the format version to the Metadata object.
 However, I am not familiar enough with the Tika source to know if an 
 alternative metadata key should be used, or this new one added.
 Comments welcome.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox

2014-06-23 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040819#comment-14040819
 ] 

Tyler Palsulich commented on TIKA-617:
--

Hi, 

Are you still having this issue? Do you have the/a PDF which caused this 
exception? Thanks!

Tyler

 Series of exceptions from PDFBox
 

 Key: TIKA-617
 URL: https://issues.apache.org/jira/browse/TIKA-617
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Erik Hetzner

 Hi,
 I am getting the following exception from PDFBox. Thank you!
 (If I should file these upstream at PDFBox first, please let me know.)
 {noformat}
 $ java -jar tika-app-1.0-SNAPSHOT.jar 
 http://www.arb.ca.gov/research/apr/past/01-340.pdf  /dev/null
 ERROR - Stop reading corrupt stream
 INFO - unsupported/disabled operation: f24.481
 INFO - unsupported/disabled operation: ree)n.
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
 INFO - unsupported/disabled operation: i-
 INFO - unsupported/disabled operation: R4%
 INFO - unsupported/disabled operation: )
 INFO - unsupported/disabled operation: Re.8
 INFO - unsupported/disabled operation: e.
 INFO - unsupported/disabled operation: FE)-
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
 INFO - unsupported/disabled operation: R3%
 INFO - unsupported/disabled operation: T
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at 

[jira] [Commented] (TIKA-1187) java.lang.OutOfMemoryError: Java heap space

2014-06-23 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040830#comment-14040830
 ] 

Tyler Palsulich commented on TIKA-1187:
---

Hi [~Guffi],

Did you ever get this issue resolved? Do you still have the problematic file?

Thanks,
Tyler

 java.lang.OutOfMemoryError: Java heap space
 ---

 Key: TIKA-1187
 URL: https://issues.apache.org/jira/browse/TIKA-1187
 Project: Tika
  Issue Type: Bug
  Components: general
Affects Versions: 1.3
 Environment: Ubuntu 
Reporter: GURFAN
Priority: Critical
   Original Estimate: 612h
  Remaining Estimate: 612h

 Hi,
 While parsing the content we are getting below exception in parse method.
 The file which we are parsing is 1 mb.
 TIKA JAR:  tika-core-1.3.jar
 File size: 1 MB.
 Parser parser = new AutoDetectParser();
 parser.parse(is, handler, metaData, new ParseContext());
 java.lang.OutOfMemoryError: Java heap space
   at java.util.Arrays.copyOf(Arrays.java:2734)
   at java.util.ArrayList.ensureCapacity(ArrayList.java:167)
   at java.util.ArrayList.add(ArrayList.java:351)
   at 
 org.apache.fontbox.ttf.GlyfCompositeDescript.(GlyfCompositeDescript.java:60)
   at org.apache.fontbox.ttf.GlyphData.initData(GlyphData.java:63)
   at org.apache.fontbox.ttf.GlyphTable.initData(GlyphTable.java:71)
   at 
 org.apache.fontbox.ttf.AbstractTTFParser.parseTables(AbstractTTFParser.java:163)
   at org.apache.fontbox.ttf.TTFParser.parseTables(TTFParser.java:61)
   at 
 org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:90)
   at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:26)
   at 
 org.apache.fontbox.ttf.AbstractTTFParser.parseTTF(AbstractTTFParser.java:66)
   at org.apache.fontbox.ttf.TTFParser.parseTTF(TTFParser.java:26)
   at 
 org.apache.tika.parser.font.TrueTypeParser.parse(TrueTypeParser.java:65)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at 
 com.impetus.vajra.parser.tika.TikaParser.processContent(TikaParser.java:96)
   at 
 com.impetus.vajra.storm.helper.TextAnalyserBoltHelper.execute(TextAnalyserBoltHelper.java:283)
   at 
 com.impetus.vajra.storm.TextAnalyserBolt.execute(TextAnalyserBolt.java:182)
   at 
 backtype.storm.daemon.executor$fn__4050$tuple_action_fn__4052.invoke(executor.clj:566)
   at 
 backtype.storm.daemon.executor$mk_task_receiver$fn__3976.invoke(executor.clj:345)
   at 
 backtype.storm.disruptor$clojure_handler$reify__1606.onEvent(disruptor.clj:43)
   at 
 backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:84)
   at 
 backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:58)
   at 
 backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:62)
   at 
 backtype.storm.daemon.executor$fn__4050$fn__4059$fn__4106.invoke(executor.clj:658)
   at backtype.storm.util$async_loop$fn__465.invoke(util.clj:377)
   at clojure.lang.AFn.run(AFn.java:24)
   at java.lang.Thread.run(Thread.java:662)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1352) Upgrade to PDFBox 1.8.6

2014-06-23 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1352:
-

 Summary: Upgrade to PDFBox 1.8.6
 Key: TIKA-1352
 URL: https://issues.apache.org/jira/browse/TIKA-1352
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1352) Upgrade to PDFBox 1.8.6

2014-06-23 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1352:
--

Attachment: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip

Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130 
workarounds removed) on a random selection of 10k pdf files in govdocs1.

Both runs used the older sequential parser.

The table file is a tab-delimited UTF-16LE file.

This is a first go at the initial/raw output of comparison code for TIKA-1302.  
Much more work remains.

The ZipBomb exceptions show that we cannot yet remove PDFBox-1130 workarounds. 

Other than that, we should probably look at the few hundred files that have 
token overlap of  98%.

To view the original files from gov docs (e.g. 765470), navigate to:

http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf

 Upgrade to PDFBox 1.8.6
 ---

 Key: TIKA-1352
 URL: https://issues.apache.org/jira/browse/TIKA-1352
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1352) Upgrade to PDFBox 1.8.6

2014-06-23 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1352:
--

Description: This is to track moving to PDFBox 1.8.6.

 Upgrade to PDFBox 1.8.6
 ---

 Key: TIKA-1352
 URL: https://issues.apache.org/jira/browse/TIKA-1352
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip


 This is to track moving to PDFBox 1.8.6.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: [jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note

2014-06-23 Thread Nick Burch

On Mon, 23 Jun 2014, kevin slote wrote:

What tika version will have the pst support?


See TIKA-623 - PST support is already in trunk, and will be included in 
Tika 1.6 when that gets released


Nick


[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox

2014-06-23 Thread Erik Hetzner (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040968#comment-14040968
 ] 

Erik Hetzner commented on TIKA-617:
---

The URL containing the PDF is listed in the above comment. Trying it with 1.5 
gives different errors and generates an incomplete XML file:

{noformat}
java -jar tika-app-1.5.jar http://www.arb.ca.gov/research/apr/past/01-340.pdf  
 /dev/null
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
ERROR - FlateFilter: stop reading corrupt stream due to a DataFormatException
Exception in thread main org.apache.tika.exception.TikaException: Unable to 
extract PDF content
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:122)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:143)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112)
Caused by: java.io.IOException
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:138)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:336)
at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:248)
at 
org.apache.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:183)
at 
org.apache.pdfbox.pdfparser.PDFStreamParser.init(PDFStreamParser.java:107)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
at 
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
at 
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
at 
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)
at 
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)
at 
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:106)
... 7 more
Caused by: java.util.zip.DataFormatException: invalid distance too far back
at java.util.zip.Inflater.inflateBytes(Native Method)
at java.util.zip.Inflater.inflate(Inflater.java:259)
at java.util.zip.Inflater.inflate(Inflater.java:280)
at org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:169)
at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:98)
... 18 more
{noformat}

 Series of exceptions from PDFBox
 

 Key: TIKA-617
 URL: https://issues.apache.org/jira/browse/TIKA-617
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Erik Hetzner

 Hi,
 I am getting the following exception from PDFBox. Thank you!
 (If I should file these upstream at PDFBox first, please let me know.)
 {noformat}
 $ java -jar tika-app-1.0-SNAPSHOT.jar 
 http://www.arb.ca.gov/research/apr/past/01-340.pdf  /dev/null
 ERROR - Stop reading corrupt stream
 INFO - unsupported/disabled operation: f24.481
 INFO - unsupported/disabled operation: ree)n.
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 

Re: [jira] [Commented] (TIKA-1350) OutlookPSTParser: Unknown message type: IPM.Note

2014-06-23 Thread kevin slote
Do you already have this issue fixed?  I encountered something similar to
this and already worked it out.


On Mon, Jun 23, 2014 at 1:03 PM, Nick Burch apa...@gagravarr.org wrote:

 On Mon, 23 Jun 2014, kevin slote wrote:

 What tika version will have the pst support?


 See TIKA-623 - PST support is already in trunk, and will be included in
 Tika 1.6 when that gets released

 Nick



[jira] [Commented] (TIKA-617) Series of exceptions from PDFBox

2014-06-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041025#comment-14041025
 ] 

Tim Allison commented on TIKA-617:
--

Confirmed still a problem with both classic (sequential) and newer 
NonSequentialParser in Tika trunk with PDFBox 1.8.6.  Please open an issue in 
PDFBox if you haven't done so already.  Thank you!

Found same issue here (although Adobe couldn't read this one either without 
serious problems):
http://digitalcorpora.org/corp/nps/files/govdocs1/898/898385.pdf

 Series of exceptions from PDFBox
 

 Key: TIKA-617
 URL: https://issues.apache.org/jira/browse/TIKA-617
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Erik Hetzner

 Hi,
 I am getting the following exception from PDFBox. Thank you!
 (If I should file these upstream at PDFBox first, please let me know.)
 {noformat}
 $ java -jar tika-app-1.0-SNAPSHOT.jar 
 http://www.arb.ca.gov/research/apr/past/01-340.pdf  /dev/null
 ERROR - Stop reading corrupt stream
 INFO - unsupported/disabled operation: f24.481
 INFO - unsupported/disabled operation: ree)n.
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
 INFO - unsupported/disabled operation: i-
 INFO - unsupported/disabled operation: R4%
 INFO - unsupported/disabled operation: )
 INFO - unsupported/disabled operation: Re.8
 INFO - unsupported/disabled operation: e.
 INFO - unsupported/disabled operation: FE)-
 WARN - java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot 
 be cast to org.apache.pdfbox.cos.COSArray
 java.lang.ClassCastException: org.apache.pdfbox.cos.COSInteger cannot be cast 
 to org.apache.pdfbox.cos.COSArray
   at 
 org.apache.pdfbox.util.operator.ShowTextGlyph.process(ShowTextGlyph.java:44)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:551)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at 
 org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at 
 org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at 
 org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:89)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:107)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:302)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:91)
 INFO - unsupported/disabled operation: R3%
 INFO - unsupported/disabled operation: T
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from org.apache.tika.parser.pdf.PDFParser@5809fdee
   at 
 

[jira] [Comment Edited] (TIKA-1352) Upgrade to PDFBox 1.8.6

2014-06-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14040934#comment-14040934
 ] 

Tim Allison edited comment on TIKA-1352 at 6/23/14 5:39 PM:


Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130 
workarounds removed) on a random selection of 10k pdf files in govdocs1.

Both runs used the older sequential parser.

The table file is a tab-delimited UTF-16LE file.

This is a first go at the initial/raw output of comparison code for TIKA-1302.  
Much more work remains.

The ZipBomb exceptions are caused by my incorrect first attempt to remove 
PDFBox-1130 workarounds. These will go away.

Other than that, we should probably look at the few hundred files that have 
token overlap of  98%.

To view the original files from gov docs (e.g. 765470), navigate to:

http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf


was (Author: talli...@mitre.org):
Diffs btwn PDFBox 1.8.5 and 1.8.6 on Tika 1.6 SNAPSHOT (with PDFBox-1130 
workarounds removed) on a random selection of 10k pdf files in govdocs1.

Both runs used the older sequential parser.

The table file is a tab-delimited UTF-16LE file.

This is a first go at the initial/raw output of comparison code for TIKA-1302.  
Much more work remains.

The ZipBomb exceptions show that we cannot yet remove PDFBox-1130 workarounds. 

Other than that, we should probably look at the few hundred files that have 
token overlap of  98%.

To view the original files from gov docs (e.g. 765470), navigate to:

http://digitalcorpora.org/corp/nps/files/govdocs1/765/765470.pdf

 Upgrade to PDFBox 1.8.6
 ---

 Key: TIKA-1352
 URL: https://issues.apache.org/jira/browse/TIKA-1352
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip


 This is to track moving to PDFBox 1.8.6.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata

2014-06-23 Thread Steve R (JIRA)
Steve R created TIKA-1353:
-

 Summary: OpenDocumentParser doesn't correctly process metadata
 Key: TIKA-1353
 URL: https://issues.apache.org/jira/browse/TIKA-1353
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.5
Reporter: Steve R


When using OpenDocumentParser, the metadata isn't set correctly. When using it 
to write an html file, the only metadata that it knows about is content type 
because it is set ahead of time.

The problem is that when iterating over the zip contents, meta.xml isn't 
processed before content.xml. The metadata set on the parse object is correct 
after parse() returns, however the contents of the resulting html file is 
missing all of the metadata.

Changing the code to be 

boolean parsedMetaData = false;
boolean delayLoadContent = false;
while (entry != null) {
...
} else if (entry.getName().equals(meta.xml)) {
meta.parse(zip, new DefaultHandler(), metadata, context);
parsedMetaData = true;

if (delayLoadContent) {
if (content instanceof OpenDocumentContentParser) {
((OpenDocumentContentParser) 
content).parseInternal(zip, handler, metadata, context);
} else {
// Foreign content parser was set:
content.parse(zip, handler, metadata, context);
}
}
} else if (entry.getName().endsWith(content.xml)) {
if (!parsedMetaData) {
delayLoadContent = true;
} else {
if (content instanceof OpenDocumentContentParser) {
((OpenDocumentContentParser) 
content).parseInternal(zip, handler, metadata, context);
} else {
// Foreign content parser was set:
content.parse(zip, handler, metadata, context);
}
}
}

works as expected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Review Request 22892: New parser for ENVI header files

2014-06-23 Thread Ann Burgess

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22892/
---

Review request for tika.


Bugs: TIKA-1274
https://issues.apache.org/jira/browse/TIKA-1274


Repository: tika


Description
---

New parser for ENVI header files.  Note, this is a parser for header files that 
will have an associated, separate data file.  This parser will not extract 
content from the data file. 


Diffs
-

  
trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java
 PRE-CREATION 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java
 PRE-CREATION 
  trunk/tika-parsers/src/test/resources/test-documents/envi_test_header.hdr 
PRE-CREATION 

Diff: https://reviews.apache.org/r/22892/diff/


Testing
---

Text parsing test completed with file envi_test_header.hdr. 


Thanks,

Ann Burgess



Re: Review Request 22892: New parser for ENVI header files

2014-06-23 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22892/#review46459
---


Looks great Annie, with the package updates I think I can commit this.

- Chris Mattmann


On June 23, 2014, 9:43 p.m., Ann Burgess wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/22892/
 ---
 
 (Updated June 23, 2014, 9:43 p.m.)
 
 
 Review request for tika.
 
 
 Bugs: TIKA-1274
 https://issues.apache.org/jira/browse/TIKA-1274
 
 
 Repository: tika
 
 
 Description
 ---
 
 New parser for ENVI header files.  Note, this is a parser for header files 
 that will have an associated, separate data file.  This parser will not 
 extract content from the data file. 
 
 
 Diffs
 -
 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java
  PRE-CREATION 
   trunk/tika-parsers/src/test/resources/test-documents/envi_test_header.hdr 
 PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/22892/diff/
 
 
 Testing
 ---
 
 Text parsing test completed with file envi_test_header.hdr. 
 
 
 Thanks,
 
 Ann Burgess
 




Re: Review Request 22892: New parser for ENVI header files

2014-06-23 Thread Chris Mattmann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22892/#review46457
---



trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java
https://reviews.apache.org/r/22892/#comment81848

org.apache.tika.parser.envi



trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java
https://reviews.apache.org/r/22892/#comment81849

org.apache.tika.parser.envi


- Chris Mattmann


On June 23, 2014, 9:43 p.m., Ann Burgess wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/22892/
 ---
 
 (Updated June 23, 2014, 9:43 p.m.)
 
 
 Review request for tika.
 
 
 Bugs: TIKA-1274
 https://issues.apache.org/jira/browse/TIKA-1274
 
 
 Repository: tika
 
 
 Description
 ---
 
 New parser for ENVI header files.  Note, this is a parser for header files 
 that will have an associated, separate data file.  This parser will not 
 extract content from the data file. 
 
 
 Diffs
 -
 
   
 trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java
  PRE-CREATION 
   
 trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java
  PRE-CREATION 
   trunk/tika-parsers/src/test/resources/test-documents/envi_test_header.hdr 
 PRE-CREATION 
 
 Diff: https://reviews.apache.org/r/22892/diff/
 
 
 Testing
 ---
 
 Text parsing test completed with file envi_test_header.hdr. 
 
 
 Thanks,
 
 Ann Burgess
 




Re: Review Request 22892: New parser for ENVI header files

2014-06-23 Thread Ann Burgess

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22892/
---

(Updated June 23, 2014, 10:01 p.m.)


Review request for tika.


Bugs: TIKA-1274
https://issues.apache.org/jira/browse/TIKA-1274


Repository: tika


Description
---

New parser for ENVI header files.  Note, this is a parser for header files that 
will have an associated, separate data file.  This parser will not extract 
content from the data file. 


Diffs (updated)
-

  
trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java
 PRE-CREATION 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java
 PRE-CREATION 
  trunk/tika-parsers/src/test/resources/test-documents/envi_test_header.hdr 
PRE-CREATION 

Diff: https://reviews.apache.org/r/22892/diff/


Testing
---

Text parsing test completed with file envi_test_header.hdr. 


Thanks,

Ann Burgess



Re: Review Request 22892: New parser for ENVI header files

2014-06-23 Thread Ann Burgess

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22892/
---

(Updated June 23, 2014, 11:14 p.m.)


Review request for tika.


Bugs: TIKA-1274
https://issues.apache.org/jira/browse/TIKA-1274


Repository: tika


Description
---

New parser for ENVI header files.  Note, this is a parser for header files that 
will have an associated, separate data file.  This parser will not extract 
content from the data file. 


Diffs (updated)
-

  
trunk/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java
 PRE-CREATION 
  
trunk/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java
 PRE-CREATION 
  trunk/tika-parsers/src/test/resources/test-documents/envi_test_header.hdr 
PRE-CREATION 

Diff: https://reviews.apache.org/r/22892/diff/


Testing
---

Text parsing test completed with file envi_test_header.hdr. 


Thanks,

Ann Burgess



[jira] [Resolved] (TIKA-1352) Upgrade to PDFBox 1.8.6

2014-06-23 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1352.
---

   Resolution: Fixed
Fix Version/s: 1.6

It turns out that we still need the PDFBOX-1130 workarounds because of the 
newly discovered PDFBOX-2160.

Other than that, the results are quite similar with 1.8.5 and 1.8.6.  We can 
remove quite a bit of clutter in the test cases because of improvements in 
1.8.6, and we can unignore another test.

Many thanks to our colleagues at PDFBox for all of these mods, especially 
[~tilman], [~jahewson] and [~lehmi]!

I included some of the code cleanups recommended by [~tpalsulich] over on 
TIKA-758.  Thank you, Tyler!

Finally, [~lfcnassif], if you have a chance to run trunk against your test set, 
please let us know if there are any surprises.  I'd prefer not to have a rerun 
of TIKA-1233.

Fixed r1604989.

 Upgrade to PDFBox 1.8.6
 ---

 Key: TIKA-1352
 URL: https://issues.apache.org/jira/browse/TIKA-1352
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.6

 Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip


 This is to track moving to PDFBox 1.8.6.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-758) Address TODOs when we upgrade to next PDFBox release

2014-06-23 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041569#comment-14041569
 ] 

Tim Allison commented on TIKA-758:
--

[~tpalsulich], thank you for the patch.  It turns out that we need to retain 
this work-around until PDFBOX-2160's fix makes it into the next release of 
PDFBox (1.8.7?).  I included your code cleanups in the mods for TIKA-1352.  
Thank you!

 Address TODOs when we upgrade to next PDFBox release
 

 Key: TIKA-758
 URL: https://issues.apache.org/jira/browse/TIKA-758
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Michael McCandless
 Attachments: TIKA-758.Palsulich.061714.patch


 Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in 
 the code when we next upgrade PDFBox.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1352) Upgrade to PDFBox 1.8.6

2014-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041615#comment-14041615
 ] 

Hudson commented on TIKA-1352:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #62 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/62/])
TIKA-1352 update CHANGES.txt (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1604995)
* /tika/trunk/CHANGES.txt
TIKA-1352 upgrade to PDFBox 1.8.6 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1604989)
* /tika/trunk/tika-parsers/pom.xml
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


 Upgrade to PDFBox 1.8.6
 ---

 Key: TIKA-1352
 URL: https://issues.apache.org/jira/browse/TIKA-1352
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.6

 Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip


 This is to track moving to PDFBox 1.8.6.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1352) Upgrade to PDFBox 1.8.6

2014-06-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041632#comment-14041632
 ] 

Hudson commented on TIKA-1352:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #62 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/62/])
TIKA-1352 update CHANGES.txt (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1604995)
* /tika/trunk/CHANGES.txt
TIKA-1352 upgrade to PDFBox 1.8.6 (tallison: 
http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1604989)
* /tika/trunk/tika-parsers/pom.xml
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java


 Upgrade to PDFBox 1.8.6
 ---

 Key: TIKA-1352
 URL: https://issues.apache.org/jira/browse/TIKA-1352
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.6

 Attachments: TIKA_1_6_SNAPSHOT_PDFBOX_1.8.5_vsPDFBOX_1.8.6.zip


 This is to track moving to PDFBox 1.8.6.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata

2014-06-23 Thread Steve R (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041639#comment-14041639
 ] 

Steve R commented on TIKA-1353:
---

Ignore my suggested code example, it clearly doesn't work.

My question is now this, why is the following code commented out? It seems to 
work.

/*
 * ZipFile zipFile; if (stream instanceof TikaInputStream) { 
TikaInputStream tis = (TikaInputStream) stream;
 * Object container = ((TikaInputStream) stream).getOpenContainer(); if 
(container instanceof ZipFile) { zipFile
 * = (ZipFile) container; } else if (tis.hasFile()) { zipFile = new 
ZipFile(tis.getFile()); } }
 */

// TODO: if incoming IS is a TIS with a file
// associated, we should open ZipFile so we can
// visit metadata, mimetype first; today we lose
// all the metadata if meta.xml is hit after
// content.xml in the stream. Then we can still
// read-once for the content.xml.

 OpenDocumentParser doesn't correctly process metadata
 -

 Key: TIKA-1353
 URL: https://issues.apache.org/jira/browse/TIKA-1353
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.5
Reporter: Steve R
   Original Estimate: 24h
  Remaining Estimate: 24h

 When using OpenDocumentParser, the metadata isn't set correctly. When using 
 it to write an html file, the only metadata that it knows about is content 
 type because it is set ahead of time.
 The problem is that when iterating over the zip contents, meta.xml isn't 
 processed before content.xml. The metadata set on the parse object is correct 
 after parse() returns, however the contents of the resulting html file is 
 missing all of the metadata.
 Changing the code to be 
 boolean parsedMetaData = false;
 boolean delayLoadContent = false;
 while (entry != null) {
 ...
 } else if (entry.getName().equals(meta.xml)) {
 meta.parse(zip, new DefaultHandler(), metadata, context);
 parsedMetaData = true;
 if (delayLoadContent) {
 if (content instanceof OpenDocumentContentParser) {
 ((OpenDocumentContentParser) 
 content).parseInternal(zip, handler, metadata, context);
 } else {
 // Foreign content parser was set:
 content.parse(zip, handler, metadata, context);
 }
 }
 } else if (entry.getName().endsWith(content.xml)) {
 if (!parsedMetaData) {
 delayLoadContent = true;
 } else {
 if (content instanceof OpenDocumentContentParser) {
 ((OpenDocumentContentParser) 
 content).parseInternal(zip, handler, metadata, context);
 } else {
 // Foreign content parser was set:
 content.parse(zip, handler, metadata, context);
 }
 }
 }
 works as expected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata

2014-06-23 Thread Steve R (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041639#comment-14041639
 ] 

Steve R edited comment on TIKA-1353 at 6/24/14 3:09 AM:


Ignore my suggested code example, it clearly doesn't work.

My question is now this, why is the following code commented out? It seems to 
work.

/*
 ZipFile zipFile; 
 if (stream instanceof TikaInputStream) { 
TikaInputStream tis = (TikaInputStream) stream;
Object container = ((TikaInputStream) 
stream).getOpenContainer(); 
if (container instanceof ZipFile) { 
zipFile  = (ZipFile) container; 
 else if (tis.hasFile()) { 
 zipFile = new ZipFile(tis.getFile()); 
 } 
}
 */

// TODO: if incoming IS is a TIS with a file
// associated, we should open ZipFile so we can
// visit metadata, mimetype first; today we lose
// all the metadata if meta.xml is hit after
// content.xml in the stream. Then we can still
// read-once for the content.xml.


was (Author: svramusi):
Ignore my suggested code example, it clearly doesn't work.

My question is now this, why is the following code commented out? It seems to 
work.

/*
 * ZipFile zipFile; if (stream instanceof TikaInputStream) { 
TikaInputStream tis = (TikaInputStream) stream;
 * Object container = ((TikaInputStream) stream).getOpenContainer(); if 
(container instanceof ZipFile) { zipFile
 * = (ZipFile) container; } else if (tis.hasFile()) { zipFile = new 
ZipFile(tis.getFile()); } }
 */

// TODO: if incoming IS is a TIS with a file
// associated, we should open ZipFile so we can
// visit metadata, mimetype first; today we lose
// all the metadata if meta.xml is hit after
// content.xml in the stream. Then we can still
// read-once for the content.xml.

 OpenDocumentParser doesn't correctly process metadata
 -

 Key: TIKA-1353
 URL: https://issues.apache.org/jira/browse/TIKA-1353
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.5
Reporter: Steve R
   Original Estimate: 24h
  Remaining Estimate: 24h

 When using OpenDocumentParser, the metadata isn't set correctly. When using 
 it to write an html file, the only metadata that it knows about is content 
 type because it is set ahead of time.
 The problem is that when iterating over the zip contents, meta.xml isn't 
 processed before content.xml. The metadata set on the parse object is correct 
 after parse() returns, however the contents of the resulting html file is 
 missing all of the metadata.
 Changing the code to be 
 boolean parsedMetaData = false;
 boolean delayLoadContent = false;
 while (entry != null) {
 ...
 } else if (entry.getName().equals(meta.xml)) {
 meta.parse(zip, new DefaultHandler(), metadata, context);
 parsedMetaData = true;
 if (delayLoadContent) {
 if (content instanceof OpenDocumentContentParser) {
 ((OpenDocumentContentParser) 
 content).parseInternal(zip, handler, metadata, context);
 } else {
 // Foreign content parser was set:
 content.parse(zip, handler, metadata, context);
 }
 }
 } else if (entry.getName().endsWith(content.xml)) {
 if (!parsedMetaData) {
 delayLoadContent = true;
 } else {
 if (content instanceof OpenDocumentContentParser) {
 ((OpenDocumentContentParser) 
 content).parseInternal(zip, handler, metadata, context);
 } else {
 // Foreign content parser was set:
 content.parse(zip, handler, metadata, context);
 }
 }
 }
 works as expected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1353) OpenDocumentParser doesn't correctly process metadata

2014-06-23 Thread Steve R (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041639#comment-14041639
 ] 

Steve R edited comment on TIKA-1353 at 6/24/14 3:09 AM:


Ignore my suggested code example, it clearly doesn't work.

My question is now this, why is the following code commented out? It seems to 
work.

/*
 ZipFile zipFile; 
 if (stream instanceof TikaInputStream) { 
TikaInputStream tis = (TikaInputStream) stream;
Object container = ((TikaInputStream) 
stream).getOpenContainer(); 
if (container instanceof ZipFile) { 
zipFile  = (ZipFile) container; 
} else if (tis.hasFile()) { 
 zipFile = new ZipFile(tis.getFile()); 
 } 
}
 */

// TODO: if incoming IS is a TIS with a file
// associated, we should open ZipFile so we can
// visit metadata, mimetype first; today we lose
// all the metadata if meta.xml is hit after
// content.xml in the stream. Then we can still
// read-once for the content.xml.


was (Author: svramusi):
Ignore my suggested code example, it clearly doesn't work.

My question is now this, why is the following code commented out? It seems to 
work.

/*
 ZipFile zipFile; 
 if (stream instanceof TikaInputStream) { 
TikaInputStream tis = (TikaInputStream) stream;
Object container = ((TikaInputStream) 
stream).getOpenContainer(); 
if (container instanceof ZipFile) { 
zipFile  = (ZipFile) container; 
 else if (tis.hasFile()) { 
 zipFile = new ZipFile(tis.getFile()); 
 } 
}
 */

// TODO: if incoming IS is a TIS with a file
// associated, we should open ZipFile so we can
// visit metadata, mimetype first; today we lose
// all the metadata if meta.xml is hit after
// content.xml in the stream. Then we can still
// read-once for the content.xml.

 OpenDocumentParser doesn't correctly process metadata
 -

 Key: TIKA-1353
 URL: https://issues.apache.org/jira/browse/TIKA-1353
 Project: Tika
  Issue Type: Bug
  Components: metadata, parser
Affects Versions: 1.5
Reporter: Steve R
   Original Estimate: 24h
  Remaining Estimate: 24h

 When using OpenDocumentParser, the metadata isn't set correctly. When using 
 it to write an html file, the only metadata that it knows about is content 
 type because it is set ahead of time.
 The problem is that when iterating over the zip contents, meta.xml isn't 
 processed before content.xml. The metadata set on the parse object is correct 
 after parse() returns, however the contents of the resulting html file is 
 missing all of the metadata.
 Changing the code to be 
 boolean parsedMetaData = false;
 boolean delayLoadContent = false;
 while (entry != null) {
 ...
 } else if (entry.getName().equals(meta.xml)) {
 meta.parse(zip, new DefaultHandler(), metadata, context);
 parsedMetaData = true;
 if (delayLoadContent) {
 if (content instanceof OpenDocumentContentParser) {
 ((OpenDocumentContentParser) 
 content).parseInternal(zip, handler, metadata, context);
 } else {
 // Foreign content parser was set:
 content.parse(zip, handler, metadata, context);
 }
 }
 } else if (entry.getName().endsWith(content.xml)) {
 if (!parsedMetaData) {
 delayLoadContent = true;
 } else {
 if (content instanceof OpenDocumentContentParser) {
 ((OpenDocumentContentParser) 
 content).parseInternal(zip, handler, metadata, context);
 } else {
 // Foreign content parser was set:
 content.parse(zip, handler, metadata, context);
 }
 }
 }
 works as expected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)