Patching fix for Tika-521 on Tika 0.8

2012-11-21 Thread Jana, Kumar Raja
Hi,
Is it possible to patch the fix for Tika-521 to Tika 0.8 without upgrading to 
POI 3.8?

There is TikaExcelEventBasedExtraction.diff attached to the Jira case, can this 
be used to resolve the issue? We are tied down to POI 3.7 and cannot move to 
POI 3.8 due to compatibility issues with other code.

Thank you.

Regards,
Kumar


[jira] [Created] (TIKA-1028) Tika-server quits parsing of rfc-822 document prematurely when it encounters encrypted zip file as attachment.

2012-11-21 Thread Juha Haaga (JIRA)
Juha Haaga created TIKA-1028:


 Summary: Tika-server quits parsing of rfc-822 document prematurely 
when it encounters encrypted zip file as attachment.
 Key: TIKA-1028
 URL: https://issues.apache.org/jira/browse/TIKA-1028
 Project: Tika
  Issue Type: Bug
  Components: mime, parser, server
Affects Versions: 1.2, 1.3
Reporter: Juha Haaga


The Zip parser in tika-server does not allow passing in the password for 
decrypting the zip file and doesn't handle the unsupported feature gracefully. 
Problem happens when zip file is attached part of email document being parsed, 
and the parser gives up and throws an exception:

WARNING: all: Unpacker failed
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.pkg.PackageParser@10fcc945

Caused by: 
org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: 
unsupported feature encryption used in entry

Instead of returning the successfully parsed components, Tika-server returns 
nothing. 

It would be better to return rest of the parsed document contents along with 
the untouched offending zip file in the archive that Tika-server returns as a 
result. Until the feature of zip file decrypting is added this would always 
return untouched zip file, and after it is implemented it should return the 
untouched zip file in the cases where wrong password was provided.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Patching fix for Tika-521 on Tika 0.8

2012-11-21 Thread Nick Burch

On Wed, 21 Nov 2012, Jana, Kumar Raja wrote:
Is it possible to patch the fix for Tika-521 to Tika 0.8 without 
upgrading to POI 3.8?


Tika 0.8 is fairly old, there have been lots of new features and bug fixes 
since then. Ditto POI 3.7


There is TikaExcelEventBasedExtraction.diff attached to the Jira case, 
can this be used to resolve the issue?


The patch won't work against 3.7 as it needed new features only in 3.8. 
You might be able to do some work and apply it to 3.7 with a few bits of 
functionality missing (eg protection info)


We are tied down to POI 3.7 and cannot move to POI 3.8 due to 
compatibility issues with other code.


POI 3.9 should be out shortly, I'd suggest you try with that, report any 
problems / regressions so they can be fixed before the final release, then 
go with that! If you look at the change list 
http://poi.apache.org/changes.html you'll see how many bug fixes there 
have been!


Nick


[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2012-11-21 Thread Vladimir L. (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502287#comment-13502287
 ] 

Vladimir L. commented on TIKA-879:
--

I've got the same problem: trying to parse Outlook Express file: *.eml and get 
default text/plain Content-Type instead of expected message/rfc822

I'm using preorg.apache.tika.parser.AutoDetectParser/pre with default 
settings and during debugging came to the same conclusion as a reporter of this 
issue: MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain) 
returns false.

If there is a way to vote for this bug to be fixed, or easy work around, please 
share it with us!

 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov

 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2012-11-21 Thread Vladimir L. (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502287#comment-13502287
 ] 

Vladimir L. edited comment on TIKA-879 at 11/21/12 8:33 PM:


I've got the same problem: trying to parse Outlook Express file: *.eml and get 
default text/plain Content-Type instead of expected message/rfc822

I'm using {{org.apache.tika.parser.AutoDetectParser}} with default settings and 
during debugging came to the same conclusion as a reporter of this issue: 
MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain) returns 
false.

If there is a way to vote for this bug to be fixed, or easy work around, please 
share it with us!

  was (Author: vladimir_l):
I've got the same problem: trying to parse Outlook Express file: *.eml and 
get default text/plain Content-Type instead of expected message/rfc822

I'm using preorg.apache.tika.parser.AutoDetectParser/pre with default 
settings and during debugging came to the same conclusion as a reporter of this 
issue: MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain) 
returns false.

If there is a way to vote for this bug to be fixed, or easy work around, please 
share it with us!
  
 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov

 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2012-11-21 Thread Vladimir L. (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502393#comment-13502393
 ] 

Vladimir L. commented on TIKA-879:
--

I figured out where the root of problem lays on.
The original Tika configuration for message/rfc822 is following:
{code:xml}
  mime-type type=message/rfc822
magic priority=50
  match value=Relay-Version: type=string offset=0/
  match value=#!\ rnews type=string offset=0/
  match value=N#!\ rnews type=string offset=0/
  match value=Forward\ to type=string offset=0/
  match value=Pipe\ to type=string offset=0/
  match value=Return-Path: type=string offset=0/
  match value=From: type=string offset=0/
  match value=Received: type=string offset=0/
  match type=string value=Message-ID: offset=0/
  match type=string value=Date: offset=0/
/magic
glob pattern=*.eml/
glob pattern=*.mime/
glob pattern=*.mht/
glob pattern=*.mhtml/
  /mime-type
{code}

Unfortunately the e-mail message I'm testing with has following header:
{code}
x-store-info:J++/JTCzmObr++wNraA4 .
Authentication-Results: something.com; sender-id= ..
X-SID-PRA: vladimi...@example.com
X-SID-Result: Pass
X-DKIM-Result: None
X-AUTH-Result: PASS
X-Message-Status: n:n
X-Message-Delivery: Vj0xLjE7dXM .
X-Message-Info: aKlYzGSc+Ll01bU5 
Received: from mailout- 
Received: (qmail invoked by alias); 21 Nov 2012 20:11:35 -
Received: from mp017. 
X-Authenticated: #2407 
X-Provags-ID: V01U2FsdGVkX 
Received: (qmail 22194 invoked by uid 0); 21 Nov 2012 20:11:34 -
Received: from 
Content-Type: text/plain; charset=utf-8
Date: Wed, 21 Nov 2012 21:11:32 +0100
From: Vladimir L. vladimi...@example.com
Message-ID: 20121121201132.74...@example.com
MIME-Version: 1.0
Subject: JUnit test message
To: vladimi...@something.com
X-Flags: 0001
X-Mailer: WWW-Mail 6100 (Global Message Exchange)
X-Priority: 3
Content-Transfer-Encoding: 8bit
Return-Path: vladimi...@example.com
X-OriginalArrivalTime: 21 Nov 2012 20:11:36.0285 

Dear Vladimir .
{code}

As you can see none of the mentioned patterns is matching since they are all 
configured with offset=0
However the e-mail header defines the Content-Type: text/plain, which i assume 
influence the initial content type detection.
The {{sub-class-of type=text/plain/}} is not defined in mime-type 
definition, therefore auto-detection via extension *.eml fails for 
aforementioned reason of this issue.

The current workaround for me is following:
1. Create {{custom-mimetypes.xml}} as described here: 
[http://tika.apache.org/1.0/parser_guide.html#Add_your_MIME-Type]
2. Add redefinition for message/rfc822 mime-type as following:
{code:xml}
  mime-type type=message/rfc822
magic priority=50
  match value=Relay-Version: type=string offset=0/
  match value=#!\ rnews type=string offset=0/
  match value=N#!\ rnews type=string offset=0/
  match value=Forward\ to type=string offset=0/
  match value=Pipe\ to type=string offset=0/
  match value=Return-Path: type=string offset=0:2000/
  match value=From: type=string offset=0/
  match value=Received: type=string offset=0:2000/
  match value=Message-ID: type=string offset=0:2000/
  match value=Date: type=string offset=0/
/magic
glob pattern=*.eml/
glob pattern=*.mime/
glob pattern=*.mht/
glob pattern=*.mhtml/
sub-class-of type=text/plain/
  /mime-type
{code}

Note the offset settings for *Message-ID:*, *Return-Path:*, and *Received:*
I decided to leave fall-back to extension detection through definition of 
super-class {{text/plain}}

Hope this will help to go around this issue.


 Detection problem: message/rfc822 file is detected as text/plain.
 -

 Key: TIKA-879
 URL: https://issues.apache.org/jira/browse/TIKA-879
 Project: Tika
  Issue Type: Bug
  Components: metadata, mime
Affects Versions: 1.0, 1.1, 1.2
 Environment: linux 3.2.9
 oracle jdk7, openjdk7, sun jdk6
Reporter: Konstantin Gribov

 When using {{DefaultDetector}} mime type for {{.eml}} files is different (you 
 can test it on {{testRFC822}} and {{testRFC822_base64}} in 
 {{tika-parsers/src/test/resources/test-documents/}}).
 Main reason for such behavior is that only magic detector is really works for 
 such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} 
 file name in {{RESOURCE_NAME_KEY}}.
 As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, 
 text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} 
 works only by magic.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: 

[jira] [Comment Edited] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.

2012-11-21 Thread Vladimir L. (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502393#comment-13502393
 ] 

Vladimir L. edited comment on TIKA-879 at 11/21/12 10:57 PM:
-

I figured out where the root of problem lays on.
The original Tika configuration for message/rfc822 is following:
{code:xml}
  mime-type type=message/rfc822
magic priority=50
  match value=Relay-Version: type=string offset=0/
  match value=#!\ rnews type=string offset=0/
  match value=N#!\ rnews type=string offset=0/
  match value=Forward\ to type=string offset=0/
  match value=Pipe\ to type=string offset=0/
  match value=Return-Path: type=string offset=0/
  match value=From: type=string offset=0/
  match value=Received: type=string offset=0/
  match type=string value=Message-ID: offset=0/
  match type=string value=Date: offset=0/
/magic
glob pattern=*.eml/
glob pattern=*.mime/
glob pattern=*.mht/
glob pattern=*.mhtml/
  /mime-type
{code}

Unfortunately the e-mail message I'm testing with has following header:
{code}
x-store-info:J++/JTCzmObr++wNraA4 .
Authentication-Results: something.com; sender-id= ..
X-SID-PRA: vladimi...@example.com
X-SID-Result: Pass
X-DKIM-Result: None
X-AUTH-Result: PASS
X-Message-Status: n:n
X-Message-Delivery: Vj0xLjE7dXM .
X-Message-Info: aKlYzGSc+Ll01bU5 
Received: from mailout- 
Received: (qmail invoked by alias); 21 Nov 2012 20:11:35 -
Received: from mp017. 
X-Authenticated: #2407 
X-Provags-ID: V01U2FsdGVkX 
Received: (qmail 22194 invoked by uid 0); 21 Nov 2012 20:11:34 -
Received: from 
Content-Type: text/plain; charset=utf-8
Date: Wed, 21 Nov 2012 21:11:32 +0100
From: Vladimir L. vladimi...@example.com
Message-ID: 20121121201132.74...@example.com
MIME-Version: 1.0
Subject: JUnit test message
To: vladimi...@something.com
X-Flags: 0001
X-Mailer: WWW-Mail 6100 (Global Message Exchange)
X-Priority: 3
Content-Transfer-Encoding: 8bit
Return-Path: vladimi...@example.com
X-OriginalArrivalTime: 21 Nov 2012 20:11:36.0285 

Dear Vladimir .
{code}

As you can see none of the mentioned patterns is matching since they are all 
configured with offset=0
However the e-mail header defines the Content-Type: text/plain, which i assume 
influence the initial content type detection.
The {{sub-class-of type=text/plain/}} is not defined in mime-type 
definition, therefore auto-detection via extension *.eml fails for 
aforementioned reason of this issue.

The current workaround for me is following:
1. Create {{custom-mimetypes.xml}} as described here: 
[http://tika.apache.org/1.0/parser_guide.html#Add_your_MIME-Type]
2. Add redefinition for message/rfc822 mime-type as following:
{code:xml}
  mime-type type=message/rfc822
magic priority=50
  match value=Relay-Version: type=string offset=0/
  match value=#!\ rnews type=string offset=0/
  match value=N#!\ rnews type=string offset=0/
  match value=Forward\ to type=string offset=0/
  match value=Pipe\ to type=string offset=0/
  match value=Return-Path: type=string offset=0:2000/
  match value=From: type=string offset=0/
  match value=Received: type=string offset=0:2000/
  match value=Message-ID: type=string offset=0:2000/
  match value=Date: type=string offset=0/
/magic
glob pattern=*.eml/
glob pattern=*.mime/
glob pattern=*.mht/
glob pattern=*.mhtml/
sub-class-of type=text/plain/
  /mime-type
{code}

Note the offset settings for *Message-ID:*, *Return-Path:*, and *Received:*
I decided to leave fall-back to extension detection through definition of 
super-class {{text/plain}}

Hope this will help you to go around this issue too.

Good luck,
vladimir

  was (Author: vladimir_l):
I figured out where the root of problem lays on.
The original Tika configuration for message/rfc822 is following:
{code:xml}
  mime-type type=message/rfc822
magic priority=50
  match value=Relay-Version: type=string offset=0/
  match value=#!\ rnews type=string offset=0/
  match value=N#!\ rnews type=string offset=0/
  match value=Forward\ to type=string offset=0/
  match value=Pipe\ to type=string offset=0/
  match value=Return-Path: type=string offset=0/
  match value=From: type=string offset=0/
  match value=Received: type=string offset=0/
  match type=string value=Message-ID: offset=0/
  match type=string value=Date: offset=0/
/magic
glob pattern=*.eml/
glob pattern=*.mime/
glob pattern=*.mht/
glob pattern=*.mhtml/
  /mime-type
{code}

Unfortunately the e-mail message I'm testing with has following header:
{code}
x-store-info:J++/JTCzmObr++wNraA4 .
Authentication-Results: something.com; sender-id= ..
X-SID-PRA: vladimi...@example.com
X-SID-Result: Pass
X-DKIM-Result: None
X-AUTH-Result: PASS
X-Message-Status: