[jira] [Commented] (TIKA-1865) Save sender email address in Outlook MSG metadata

2016-02-26 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15169968#comment-15169968
 ] 

Jeremy B. Merrill commented on TIKA-1865:
-

My heart wants to say yes, but my calendar says no. :) Or at least not with any 
time super soon.

You're right that this is a ticket that's interesting to me, though. I did just 
get my own dump of real-life .msg files (not shareable, unfortunately) and I've 
noticed how senders' email addresses seem to get lost, which is a pain... Is 
this just a feature that is not yet implemented? Or is there an underlying 
reason why?

(Funnily enough, it matches the behavior of Outlook printouts, which gives you 
only the sender's alias, not their address -- including, most annoyingly for 
me, in the dumps of Hillary Clinton's emails that the State Dept. has been 
releasing.) 

Do we know if all the various email formats include the sender's email address, 
so it'd be theoretically accessible to Tika somehow? What even are all the 
formats for emails that Tika handles? Outlook (PST/MSG), .eml/rfc822, mbox, 
anything else?

> Save sender email address in Outlook MSG metadata
> -
>
> Key: TIKA-1865
> URL: https://issues.apache.org/jira/browse/TIKA-1865
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
> Environment: Windows 7 x64, jre 1.8.0_60 x64
>Reporter: Luis Filipe Nassif
>
> Sender email address is lost when extracting metadata from Outlook msg files. 
> Currently only sender name is extracted. That is an important information to 
> be extracted for search engines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1771) lower magic priority xhtml magic priority to ensure emails detected as message/rfc822

2015-10-15 Thread Jeremy B. Merrill (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy B. Merrill updated TIKA-1771:

Description: 
Emails I have (happy to share if you want) contain XHTML, as one part of a 
multipart email. Prior to this pull request, the priority on the 
application/xhtml+xml magic detector was 50, equal to the priority on the 
message/rfc822 detector. Because of the relative position of the two detectors 
in tika-mimetypes.xml, the emails were incorrectly detected as XHTML documents.

With this PR, by downgrading the priority of application/xhtml+xml to 40, the 
more-sensitive email magic detectors take precedence, causing the emails to be 
properly detected as message/rfc822.

I have not run this thru the govdocs tester or anything other than my own 
documents, so, full disclosure, this could cause false negative 
xhtml-detections elsewhere.

I should note this occurs on trunk, from Github, up-to-date as of Tuesday-ish.

  was:
Emails I have (happy to share if you want) contain XHTML, as one part of a 
multipart email. Prior to this pull request, the priority on the 
application/xhtml+xml magic detector was 50, equal to the priority on the 
message/rfc822 detector. Because of the relative position of the two detectors 
in tika-mimetypes.xml, the emails were incorrectly detected as XHTML documents.

With this PR, by downgrading the priority of application/xhtml+xml to 40, the 
more-sensitive email magic detectors take precedence, causing the emails to be 
properly detected as message/rfc822.

I have not run this thru the govdocs tester or anything other than my own 
documents, so, full disclosure, this could cause false negative 
xhtml-detections elsewhere.


> lower magic priority xhtml magic priority to ensure emails detected as 
> message/rfc822
> -
>
> Key: TIKA-1771
> URL: https://issues.apache.org/jira/browse/TIKA-1771
> Project: Tika
>  Issue Type: Improvement
>  Components: detector
>Reporter: Jeremy B. Merrill
>Priority: Critical
>
> Emails I have (happy to share if you want) contain XHTML, as one part of a 
> multipart email. Prior to this pull request, the priority on the 
> application/xhtml+xml magic detector was 50, equal to the priority on the 
> message/rfc822 detector. Because of the relative position of the two 
> detectors in tika-mimetypes.xml, the emails were incorrectly detected as 
> XHTML documents.
> With this PR, by downgrading the priority of application/xhtml+xml to 40, the 
> more-sensitive email magic detectors take precedence, causing the emails to 
> be properly detected as message/rfc822.
> I have not run this thru the govdocs tester or anything other than my own 
> documents, so, full disclosure, this could cause false negative 
> xhtml-detections elsewhere.
> I should note this occurs on trunk, from Github, up-to-date as of Tuesday-ish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1771) lower magic priority xhtml magic priority to ensure emails detected as message/rfc822

2015-10-15 Thread Jeremy B. Merrill (JIRA)
Jeremy B. Merrill created TIKA-1771:
---

 Summary: lower magic priority xhtml magic priority to ensure 
emails detected as message/rfc822
 Key: TIKA-1771
 URL: https://issues.apache.org/jira/browse/TIKA-1771
 Project: Tika
  Issue Type: Improvement
  Components: detector
Reporter: Jeremy B. Merrill
Priority: Critical


Emails I have (happy to share if you want) contain XHTML, as one part of a 
multipart email. Prior to this pull request, the priority on the 
application/xhtml+xml magic detector was 50, equal to the priority on the 
message/rfc822 detector. Because of the relative position of the two detectors 
in tika-mimetypes.xml, the emails were incorrectly detected as XHTML documents.

With this PR, by downgrading the priority of application/xhtml+xml to 40, the 
more-sensitive email magic detectors take precedence, causing the emails to be 
properly detected as message/rfc822.

I have not run this thru the govdocs tester or anything other than my own 
documents, so, full disclosure, this could cause false negative 
xhtml-detections elsewhere.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822

2015-07-08 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14619117#comment-14619117
 ] 

Jeremy B. Merrill commented on TIKA-1602:
-

Looks like the possible values are: 
```
Status:  O
Status:
Status:  U
Status: O
Status: R 
Status: RO
Status: U
Status: U 
```

 Detecting standards-non-compliant emails as message/rfc822
 --

 Key: TIKA-1602
 URL: https://issues.apache.org/jira/browse/TIKA-1602
 Project: Tika
  Issue Type: New Feature
  Components: mime
Reporter: Jeremy B. Merrill
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.10

 Attachments: 036491.txt.zip

   Original Estimate: 1h
  Remaining Estimate: 1h

 Tika does not properly detect certain emails as `message/rfc822` if they're 
 slightly standards-non-compliant and begin with `Status: ` as the first 
 header. I've added `Status: ` as a magic detection line in 
 tika-mimetypes.xml. 
 This solves my problem and does not appear to cause unit test failures. I 
 have not yet run the tika-batch tests.
 As further information, the emails that are processed incorrectly come from 
 dumps directly from various US public officials' mailservers. The dumps, I 
 believe since they're not intended to be transmitted over the wire, sometimes 
 are slightly non-compliant. 
 It's important to note that Tika (and the underlying library, James Mime4J) 
 do properly *parse* these emails, despite the non-compliant header. The 
 problem is getting Tika to *detect* the file as an email so that Mime4J gets 
 chosen to parse it.
 Pull request on Github at https://github.com/apache/tika/pull/40



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822

2015-07-02 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14612321#comment-14612321
 ] 

Jeremy B. Merrill commented on TIKA-1602:
-

Thank you, [~chrismattmann], [~talli...@mitre.org]  et al.!

[~talli...@mitre.org] -- got a bunch of normal headers, but also this `Status:` 
one. The only possible value in my dataset (a bunch of publicly-released emails 
from Jeb Bush's tenure as FL Gov) is `RO`, so the first lines of the emails who 
were treated improperly by Tika before this patch was uniformly `Status: RO`. 

I'm going to check the whole dataset once I manage to download it all back down 
again from storage to make sure there are no other values than `RO`.

My understanding is that some mail servers use this header internally to keep 
track of read status. When emails are exported, they retain the header, and it 
sometimes appears first -- even though the server would never send this header 
over the wire. 

 Detecting standards-non-compliant emails as message/rfc822
 --

 Key: TIKA-1602
 URL: https://issues.apache.org/jira/browse/TIKA-1602
 Project: Tika
  Issue Type: New Feature
  Components: mime
Reporter: Jeremy B. Merrill
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.10

 Attachments: 036491.txt.zip

   Original Estimate: 1h
  Remaining Estimate: 1h

 Tika does not properly detect certain emails as `message/rfc822` if they're 
 slightly standards-non-compliant and begin with `Status: ` as the first 
 header. I've added `Status: ` as a magic detection line in 
 tika-mimetypes.xml. 
 This solves my problem and does not appear to cause unit test failures. I 
 have not yet run the tika-batch tests.
 As further information, the emails that are processed incorrectly come from 
 dumps directly from various US public officials' mailservers. The dumps, I 
 believe since they're not intended to be transmitted over the wire, sometimes 
 are slightly non-compliant. 
 It's important to note that Tika (and the underlying library, James Mime4J) 
 do properly *parse* these emails, despite the non-compliant header. The 
 problem is getting Tika to *detect* the file as an email so that Mime4J gets 
 chosen to parse it.
 Pull request on Github at https://github.com/apache/tika/pull/40



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy B. Merrill updated TIKA-1608:

Attachment: 1534-attachment.doc

document failing under this bug

 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill
 Attachments: 1534-attachment.doc


 Extracting text from the Word 97-2004 document located here 
 (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails 
 with the following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@69af0db6
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at java.lang.System.arraycopy(Native Method)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
   at 
 org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
   at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   ... 5 more
 I'm using trunk from Github, which I think is a flavor of 1.9. The document 
 opens properly in Word for Mac '11.
 Happy to answer questions; I'm also on the user mailing list. If it's 
 relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
 that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505102#comment-14505102
 ] 

Jeremy B. Merrill commented on TIKA-1608:
-

POI bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=57843

 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill
 Attachments: 1534-attachment.doc


 Extracting text from the Word 97-2004 document attached here fails with the 
 following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@69af0db6
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at java.lang.System.arraycopy(Native Method)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
   at 
 org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
   at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   ... 5 more
 I'm using trunk from Github, which I think is a flavor of 1.9. The document 
 opens properly in Word for Mac '11.
 Happy to answer questions; I'm also on the user mailing list. If it's 
 relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
 that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505093#comment-14505093
 ] 

Jeremy B. Merrill commented on TIKA-1608:
-

Hi Tim,

I added the document. I'm totally cool with the document being viewed by the 
public. I can't really grant it to the ASF since I didn't create it. It's an 
attachment from an email in an email dump (http://jebemail.com) posted by 
former Florida govenor Jeb Bush. So whether it's usable is probably a question 
for the ASF's lawyers. 

But for the avoidance of doubt, I grant any rights that I might have in the 
document to the ASF.

I'll open a POI bug.

 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill
 Attachments: 1534-attachment.doc


 Extracting text from the Word 97-2004 document located here 
 (https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails 
 with the following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@69af0db6
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at java.lang.System.arraycopy(Native Method)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
   at 
 org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
   at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   ... 5 more
 I'm using trunk from Github, which I think is a flavor of 1.9. The document 
 opens properly in Word for Mac '11.
 Happy to answer questions; I'm also on the user mailing list. If it's 
 relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
 that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy B. Merrill updated TIKA-1608:

Description: 
Extracting text from the Word 97-2004 document attached here fails with the 
following stacktrace:

$ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
1534-attachment.doc 
Exception in thread main org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
at 
org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
at 
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
... 5 more

I'm using trunk from Github, which I think is a flavor of 1.9. The document 
opens properly in Word for Mac '11.

Happy to answer questions; I'm also on the user mailing list. If it's 
relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
that document here in Jira rather than on my own dropbox.)


  was:
Extracting text from the Word 97-2004 document located here 
(https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails with 
the following stacktrace:

$ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
1534-attachment.doc 
Exception in thread main org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
at 
org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
at 
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
... 5 more

I'm using trunk from Github, which I think is a flavor of 1.9. The document 
opens properly in Word for Mac '11.

Happy to answer questions; I'm also on the user mailing list. If it's 
relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
that document here in Jira rather than on my own dropbox.)



 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill
 Attachments: 1534-attachment.doc


 Extracting text from the Word 97-2004 document attached here fails with the 
 following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread 

[jira] [Commented] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-21 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14505178#comment-14505178
 ] 

Jeremy B. Merrill commented on TIKA-1608:
-

It's the only one I've found so far out of 300,000ish documents (most of which 
are plain emails, few of which are .docs).

 RuntimeException on extracting text from Word 97-2004 Document
 --

 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill
 Attachments: 1534-attachment.doc


 Extracting text from the Word 97-2004 document attached here fails with the 
 following stacktrace:
 $ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
 1534-attachment.doc 
 Exception in thread main org.apache.tika.exception.TikaException: 
 Unexpected RuntimeException from 
 org.apache.tika.parser.microsoft.OfficeParser@69af0db6
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
   at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
   at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
   at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
 Caused by: java.lang.ArrayIndexOutOfBoundsException
   at java.lang.System.arraycopy(Native Method)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
   at 
 org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
   at 
 org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
   at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
   at 
 org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
   ... 5 more
 I'm using trunk from Github, which I think is a flavor of 1.9. The document 
 opens properly in Word for Mac '11.
 Happy to answer questions; I'm also on the user mailing list. If it's 
 relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
 that document here in Jira rather than on my own dropbox.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1608) RuntimeException on extracting text from Word 97-2004 Document

2015-04-20 Thread Jeremy B. Merrill (JIRA)
Jeremy B. Merrill created TIKA-1608:
---

 Summary: RuntimeException on extracting text from Word 97-2004 
Document
 Key: TIKA-1608
 URL: https://issues.apache.org/jira/browse/TIKA-1608
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.9
Reporter: Jeremy B. Merrill


Extracting text from the Word 97-2004 document located here 
(https://www.dropbox.com/s/oeu3kp2nhk20naw/1534-attachment.doc?dl=0) fails with 
the following stacktrace:

$ java -jar /tika-app/target/tika-app-1.9-SNAPSHOT.jar --text 
1534-attachment.doc 
Exception in thread main org.apache.tika.exception.TikaException: Unexpected 
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@69af0db6
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:283)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:180)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:477)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:134)
Caused by: java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.getGrpprl(PAPFormattedDiskPage.java:171)
at 
org.apache.poi.hwpf.model.PAPFormattedDiskPage.init(PAPFormattedDiskPage.java:101)
at 
org.apache.poi.hwpf.model.OldPAPBinTable.init(OldPAPBinTable.java:49)
at org.apache.poi.hwpf.HWPFOldDocument.init(HWPFOldDocument.java:109)
at 
org.apache.tika.parser.microsoft.WordExtractor.parseWord6(WordExtractor.java:532)
at 
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:84)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:281)
... 5 more

I'm using trunk from Github, which I think is a flavor of 1.9. The document 
opens properly in Word for Mac '11.

Happy to answer questions; I'm also on the user mailing list. If it's 
relevant, I'm on java 1.7.0_55... (Also let me know if there's a way to put 
that document here in Jira rather than on my own dropbox.)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822

2015-04-13 Thread Jeremy B. Merrill (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492540#comment-14492540
 ] 

Jeremy B. Merrill commented on TIKA-1602:
-

Sounds about right, thanks for finding that for me. I'll go ahead and mark the 
issue a dupe or close it.

Any idea when that patch'll get merged into trunk? (Or -- since I'm an svn n00b 
-- if there's a way for me to download that patched version.)

 Detecting standards-non-compliant emails as message/rfc822
 --

 Key: TIKA-1602
 URL: https://issues.apache.org/jira/browse/TIKA-1602
 Project: Tika
  Issue Type: New Feature
Reporter: Jeremy B. Merrill
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 Tika does not properly detect certain emails as `message/rfc822` if they're 
 slightly standards-non-compliant and begin with `Status: ` as the first 
 header. I've added `Status: ` as a magic detection line in 
 tika-mimetypes.xml. 
 This solves my problem and does not appear to cause unit test failures. I 
 have not yet run the tika-batch tests.
 As further information, the emails that are processed incorrectly come from 
 dumps directly from various US public officials' mailservers. The dumps, I 
 believe since they're not intended to be transmitted over the wire, sometimes 
 are slightly non-compliant. 
 It's important to note that Tika (and the underlying library, James Mime4J) 
 do properly *parse* these emails, despite the non-compliant header. The 
 problem is getting Tika to *detect* the file as an email so that Mime4J gets 
 chosen to parse it.
 Pull request on Github at https://github.com/apache/tika/pull/40



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822

2015-04-13 Thread Jeremy B. Merrill (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeremy B. Merrill closed TIKA-1602.
---
Resolution: Duplicate

 Detecting standards-non-compliant emails as message/rfc822
 --

 Key: TIKA-1602
 URL: https://issues.apache.org/jira/browse/TIKA-1602
 Project: Tika
  Issue Type: New Feature
Reporter: Jeremy B. Merrill
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 Tika does not properly detect certain emails as `message/rfc822` if they're 
 slightly standards-non-compliant and begin with `Status: ` as the first 
 header. I've added `Status: ` as a magic detection line in 
 tika-mimetypes.xml. 
 This solves my problem and does not appear to cause unit test failures. I 
 have not yet run the tika-batch tests.
 As further information, the emails that are processed incorrectly come from 
 dumps directly from various US public officials' mailservers. The dumps, I 
 believe since they're not intended to be transmitted over the wire, sometimes 
 are slightly non-compliant. 
 It's important to note that Tika (and the underlying library, James Mime4J) 
 do properly *parse* these emails, despite the non-compliant header. The 
 problem is getting Tika to *detect* the file as an email so that Mime4J gets 
 chosen to parse it.
 Pull request on Github at https://github.com/apache/tika/pull/40



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1602) Detecting standards-non-compliant emails as message/rfc822

2015-04-10 Thread Jeremy B. Merrill (JIRA)
Jeremy B. Merrill created TIKA-1602:
---

 Summary: Detecting standards-non-compliant emails as message/rfc822
 Key: TIKA-1602
 URL: https://issues.apache.org/jira/browse/TIKA-1602
 Project: Tika
  Issue Type: New Feature
Reporter: Jeremy B. Merrill
Priority: Minor


Tika does not properly detect certain emails as `message/rfc822` if they're 
slightly standards-non-compliant and begin with `Status: ` as the first header. 
I've added `Status: ` as a magic detection line in tika-mimetypes.xml. 

This solves my problem and does not appear to cause unit test failures. I have 
not yet run the tika-batch tests.

As further information, the emails that are processed incorrectly come from 
dumps directly from various US public officials' mailservers. The dumps, I 
believe since they're not intended to be transmitted over the wire, sometimes 
are slightly non-compliant. 

It's important to note that Tika (and the underlying library, James Mime4J) do 
properly *parse* these emails, despite the non-compliant header. The problem is 
getting Tika to *detect* the file as an email so that Mime4J gets chosen to 
parse it.

Pull request on Github at https://github.com/apache/tika/pull/40



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)