Patching fix for Tika-521 on Tika 0.8
Hi, Is it possible to patch the fix for Tika-521 to Tika 0.8 without upgrading to POI 3.8? There is TikaExcelEventBasedExtraction.diff attached to the Jira case, can this be used to resolve the issue? We are tied down to POI 3.7 and cannot move to POI 3.8 due to compatibility issues with other code. Thank you. Regards, Kumar
[jira] [Created] (TIKA-1028) Tika-server quits parsing of rfc-822 document prematurely when it encounters encrypted zip file as attachment.
Juha Haaga created TIKA-1028: Summary: Tika-server quits parsing of rfc-822 document prematurely when it encounters encrypted zip file as attachment. Key: TIKA-1028 URL: https://issues.apache.org/jira/browse/TIKA-1028 Project: Tika Issue Type: Bug Components: mime, parser, server Affects Versions: 1.2, 1.3 Reporter: Juha Haaga The Zip parser in tika-server does not allow passing in the password for decrypting the zip file and doesn't handle the unsupported feature gracefully. Problem happens when zip file is attached part of email document being parsed, and the parser gives up and throws an exception: WARNING: all: Unpacker failed org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from org.apache.tika.parser.pkg.PackageParser@10fcc945 Caused by: org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException: unsupported feature encryption used in entry Instead of returning the successfully parsed components, Tika-server returns nothing. It would be better to return rest of the parsed document contents along with the untouched offending zip file in the archive that Tika-server returns as a result. Until the feature of zip file decrypting is added this would always return untouched zip file, and after it is implemented it should return the untouched zip file in the cases where wrong password was provided. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Patching fix for Tika-521 on Tika 0.8
On Wed, 21 Nov 2012, Jana, Kumar Raja wrote: Is it possible to patch the fix for Tika-521 to Tika 0.8 without upgrading to POI 3.8? Tika 0.8 is fairly old, there have been lots of new features and bug fixes since then. Ditto POI 3.7 There is TikaExcelEventBasedExtraction.diff attached to the Jira case, can this be used to resolve the issue? The patch won't work against 3.7 as it needed new features only in 3.8. You might be able to do some work and apply it to 3.7 with a few bits of functionality missing (eg protection info) We are tied down to POI 3.7 and cannot move to POI 3.8 due to compatibility issues with other code. POI 3.9 should be out shortly, I'd suggest you try with that, report any problems / regressions so they can be fixed before the final release, then go with that! If you look at the change list http://poi.apache.org/changes.html you'll see how many bug fixes there have been! Nick
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502287#comment-13502287 ] Vladimir L. commented on TIKA-879: -- I've got the same problem: trying to parse Outlook Express file: *.eml and get default text/plain Content-Type instead of expected message/rfc822 I'm using preorg.apache.tika.parser.AutoDetectParser/pre with default settings and during debugging came to the same conclusion as a reporter of this issue: MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain) returns false. If there is a way to vote for this bug to be fixed, or easy work around, please share it with us! Detection problem: message/rfc822 file is detected as text/plain. - Key: TIKA-879 URL: https://issues.apache.org/jira/browse/TIKA-879 Project: Tika Issue Type: Bug Components: metadata, mime Affects Versions: 1.0, 1.1, 1.2 Environment: linux 3.2.9 oracle jdk7, openjdk7, sun jdk6 Reporter: Konstantin Gribov When using {{DefaultDetector}} mime type for {{.eml}} files is different (you can test it on {{testRFC822}} and {{testRFC822_base64}} in {{tika-parsers/src/test/resources/test-documents/}}). Main reason for such behavior is that only magic detector is really works for such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} file name in {{RESOURCE_NAME_KEY}}. As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} works only by magic. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502287#comment-13502287 ] Vladimir L. edited comment on TIKA-879 at 11/21/12 8:33 PM: I've got the same problem: trying to parse Outlook Express file: *.eml and get default text/plain Content-Type instead of expected message/rfc822 I'm using {{org.apache.tika.parser.AutoDetectParser}} with default settings and during debugging came to the same conclusion as a reporter of this issue: MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain) returns false. If there is a way to vote for this bug to be fixed, or easy work around, please share it with us! was (Author: vladimir_l): I've got the same problem: trying to parse Outlook Express file: *.eml and get default text/plain Content-Type instead of expected message/rfc822 I'm using preorg.apache.tika.parser.AutoDetectParser/pre with default settings and during debugging came to the same conclusion as a reporter of this issue: MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain) returns false. If there is a way to vote for this bug to be fixed, or easy work around, please share it with us! Detection problem: message/rfc822 file is detected as text/plain. - Key: TIKA-879 URL: https://issues.apache.org/jira/browse/TIKA-879 Project: Tika Issue Type: Bug Components: metadata, mime Affects Versions: 1.0, 1.1, 1.2 Environment: linux 3.2.9 oracle jdk7, openjdk7, sun jdk6 Reporter: Konstantin Gribov When using {{DefaultDetector}} mime type for {{.eml}} files is different (you can test it on {{testRFC822}} and {{testRFC822_base64}} in {{tika-parsers/src/test/resources/test-documents/}}). Main reason for such behavior is that only magic detector is really works for such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} file name in {{RESOURCE_NAME_KEY}}. As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} works only by magic. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502393#comment-13502393 ] Vladimir L. commented on TIKA-879: -- I figured out where the root of problem lays on. The original Tika configuration for message/rfc822 is following: {code:xml} mime-type type=message/rfc822 magic priority=50 match value=Relay-Version: type=string offset=0/ match value=#!\ rnews type=string offset=0/ match value=N#!\ rnews type=string offset=0/ match value=Forward\ to type=string offset=0/ match value=Pipe\ to type=string offset=0/ match value=Return-Path: type=string offset=0/ match value=From: type=string offset=0/ match value=Received: type=string offset=0/ match type=string value=Message-ID: offset=0/ match type=string value=Date: offset=0/ /magic glob pattern=*.eml/ glob pattern=*.mime/ glob pattern=*.mht/ glob pattern=*.mhtml/ /mime-type {code} Unfortunately the e-mail message I'm testing with has following header: {code} x-store-info:J++/JTCzmObr++wNraA4 . Authentication-Results: something.com; sender-id= .. X-SID-PRA: vladimi...@example.com X-SID-Result: Pass X-DKIM-Result: None X-AUTH-Result: PASS X-Message-Status: n:n X-Message-Delivery: Vj0xLjE7dXM . X-Message-Info: aKlYzGSc+Ll01bU5 Received: from mailout- Received: (qmail invoked by alias); 21 Nov 2012 20:11:35 - Received: from mp017. X-Authenticated: #2407 X-Provags-ID: V01U2FsdGVkX Received: (qmail 22194 invoked by uid 0); 21 Nov 2012 20:11:34 - Received: from Content-Type: text/plain; charset=utf-8 Date: Wed, 21 Nov 2012 21:11:32 +0100 From: Vladimir L. vladimi...@example.com Message-ID: 20121121201132.74...@example.com MIME-Version: 1.0 Subject: JUnit test message To: vladimi...@something.com X-Flags: 0001 X-Mailer: WWW-Mail 6100 (Global Message Exchange) X-Priority: 3 Content-Transfer-Encoding: 8bit Return-Path: vladimi...@example.com X-OriginalArrivalTime: 21 Nov 2012 20:11:36.0285 Dear Vladimir . {code} As you can see none of the mentioned patterns is matching since they are all configured with offset=0 However the e-mail header defines the Content-Type: text/plain, which i assume influence the initial content type detection. The {{sub-class-of type=text/plain/}} is not defined in mime-type definition, therefore auto-detection via extension *.eml fails for aforementioned reason of this issue. The current workaround for me is following: 1. Create {{custom-mimetypes.xml}} as described here: [http://tika.apache.org/1.0/parser_guide.html#Add_your_MIME-Type] 2. Add redefinition for message/rfc822 mime-type as following: {code:xml} mime-type type=message/rfc822 magic priority=50 match value=Relay-Version: type=string offset=0/ match value=#!\ rnews type=string offset=0/ match value=N#!\ rnews type=string offset=0/ match value=Forward\ to type=string offset=0/ match value=Pipe\ to type=string offset=0/ match value=Return-Path: type=string offset=0:2000/ match value=From: type=string offset=0/ match value=Received: type=string offset=0:2000/ match value=Message-ID: type=string offset=0:2000/ match value=Date: type=string offset=0/ /magic glob pattern=*.eml/ glob pattern=*.mime/ glob pattern=*.mht/ glob pattern=*.mhtml/ sub-class-of type=text/plain/ /mime-type {code} Note the offset settings for *Message-ID:*, *Return-Path:*, and *Received:* I decided to leave fall-back to extension detection through definition of super-class {{text/plain}} Hope this will help to go around this issue. Detection problem: message/rfc822 file is detected as text/plain. - Key: TIKA-879 URL: https://issues.apache.org/jira/browse/TIKA-879 Project: Tika Issue Type: Bug Components: metadata, mime Affects Versions: 1.0, 1.1, 1.2 Environment: linux 3.2.9 oracle jdk7, openjdk7, sun jdk6 Reporter: Konstantin Gribov When using {{DefaultDetector}} mime type for {{.eml}} files is different (you can test it on {{testRFC822}} and {{testRFC822_base64}} in {{tika-parsers/src/test/resources/test-documents/}}). Main reason for such behavior is that only magic detector is really works for such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} file name in {{RESOURCE_NAME_KEY}}. As I found {{MediaTypeRegistry.isSpecializationOf(message/rfc822, text/plain)}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} works only by magic. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see:
[jira] [Comment Edited] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13502393#comment-13502393 ] Vladimir L. edited comment on TIKA-879 at 11/21/12 10:57 PM: - I figured out where the root of problem lays on. The original Tika configuration for message/rfc822 is following: {code:xml} mime-type type=message/rfc822 magic priority=50 match value=Relay-Version: type=string offset=0/ match value=#!\ rnews type=string offset=0/ match value=N#!\ rnews type=string offset=0/ match value=Forward\ to type=string offset=0/ match value=Pipe\ to type=string offset=0/ match value=Return-Path: type=string offset=0/ match value=From: type=string offset=0/ match value=Received: type=string offset=0/ match type=string value=Message-ID: offset=0/ match type=string value=Date: offset=0/ /magic glob pattern=*.eml/ glob pattern=*.mime/ glob pattern=*.mht/ glob pattern=*.mhtml/ /mime-type {code} Unfortunately the e-mail message I'm testing with has following header: {code} x-store-info:J++/JTCzmObr++wNraA4 . Authentication-Results: something.com; sender-id= .. X-SID-PRA: vladimi...@example.com X-SID-Result: Pass X-DKIM-Result: None X-AUTH-Result: PASS X-Message-Status: n:n X-Message-Delivery: Vj0xLjE7dXM . X-Message-Info: aKlYzGSc+Ll01bU5 Received: from mailout- Received: (qmail invoked by alias); 21 Nov 2012 20:11:35 - Received: from mp017. X-Authenticated: #2407 X-Provags-ID: V01U2FsdGVkX Received: (qmail 22194 invoked by uid 0); 21 Nov 2012 20:11:34 - Received: from Content-Type: text/plain; charset=utf-8 Date: Wed, 21 Nov 2012 21:11:32 +0100 From: Vladimir L. vladimi...@example.com Message-ID: 20121121201132.74...@example.com MIME-Version: 1.0 Subject: JUnit test message To: vladimi...@something.com X-Flags: 0001 X-Mailer: WWW-Mail 6100 (Global Message Exchange) X-Priority: 3 Content-Transfer-Encoding: 8bit Return-Path: vladimi...@example.com X-OriginalArrivalTime: 21 Nov 2012 20:11:36.0285 Dear Vladimir . {code} As you can see none of the mentioned patterns is matching since they are all configured with offset=0 However the e-mail header defines the Content-Type: text/plain, which i assume influence the initial content type detection. The {{sub-class-of type=text/plain/}} is not defined in mime-type definition, therefore auto-detection via extension *.eml fails for aforementioned reason of this issue. The current workaround for me is following: 1. Create {{custom-mimetypes.xml}} as described here: [http://tika.apache.org/1.0/parser_guide.html#Add_your_MIME-Type] 2. Add redefinition for message/rfc822 mime-type as following: {code:xml} mime-type type=message/rfc822 magic priority=50 match value=Relay-Version: type=string offset=0/ match value=#!\ rnews type=string offset=0/ match value=N#!\ rnews type=string offset=0/ match value=Forward\ to type=string offset=0/ match value=Pipe\ to type=string offset=0/ match value=Return-Path: type=string offset=0:2000/ match value=From: type=string offset=0/ match value=Received: type=string offset=0:2000/ match value=Message-ID: type=string offset=0:2000/ match value=Date: type=string offset=0/ /magic glob pattern=*.eml/ glob pattern=*.mime/ glob pattern=*.mht/ glob pattern=*.mhtml/ sub-class-of type=text/plain/ /mime-type {code} Note the offset settings for *Message-ID:*, *Return-Path:*, and *Received:* I decided to leave fall-back to extension detection through definition of super-class {{text/plain}} Hope this will help you to go around this issue too. Good luck, vladimir was (Author: vladimir_l): I figured out where the root of problem lays on. The original Tika configuration for message/rfc822 is following: {code:xml} mime-type type=message/rfc822 magic priority=50 match value=Relay-Version: type=string offset=0/ match value=#!\ rnews type=string offset=0/ match value=N#!\ rnews type=string offset=0/ match value=Forward\ to type=string offset=0/ match value=Pipe\ to type=string offset=0/ match value=Return-Path: type=string offset=0/ match value=From: type=string offset=0/ match value=Received: type=string offset=0/ match type=string value=Message-ID: offset=0/ match type=string value=Date: offset=0/ /magic glob pattern=*.eml/ glob pattern=*.mime/ glob pattern=*.mht/ glob pattern=*.mhtml/ /mime-type {code} Unfortunately the e-mail message I'm testing with has following header: {code} x-store-info:J++/JTCzmObr++wNraA4 . Authentication-Results: something.com; sender-id= .. X-SID-PRA: vladimi...@example.com X-SID-Result: Pass X-DKIM-Result: None X-AUTH-Result: PASS X-Message-Status: