[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14606335#comment-14606335 ] Chris A. Mattmann commented on TIKA-879: This is still open and we have a patch ready and available for TIKA-1602 - I am going to commit that patch (once updated with my comments from Github). Waiting for a more general solution is great, but if we have a patch that works at least in limited cases, my preference is to include that contribution and then improve later. > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml, mime_diffs_A_to_B.html > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505367#comment-14505367 ] Luis Filipe Nassif commented on TIKA-879: - Yes, thank you very much for testing with govdocs1 ([~gagravarr]'s suggestion)! > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505368#comment-14505368 ] Luis Filipe Nassif commented on TIKA-879: - Yes, thank you very much for testing with govdocs1 ([~gagravarr]'s suggestion)! > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505269#comment-14505269 ] Tim Allison commented on TIKA-879: -- Y, will do. Results probably tomorrow. > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505132#comment-14505132 ] Luis Filipe Nassif commented on TIKA-879: - Maybe we could keep the original magics and ADD the widened versions with a "\n" prefix to decrease the number of false positives (I have got a small number of them)? Could you try the widened magics with govdocs1 [~talli...@mitre.org]? > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501468#comment-14501468 ] Chris A. Mattmann commented on TIKA-879: So [~jeremybmerrill] proposed https://github.com/apache/tika/pull/40 as a solution to this. Nick, Konstantin - can you guys take a look and see if we can figure out how to get that included since we have code now that is fixing this. > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342738#comment-14342738 ] Nick Burch commented on TIKA-879: - It might be good to try the widened versions with Tika Batch, to see if on a wide range of files it causes any noticable slowdown or false positives? I still think this isn't a file format that can be fully reliably detected with mime magic alone, and ideally we do need a dedicated detector for it as mentioned above, to fully solve this and related (eg multipart/signed) detection > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14342670#comment-14342670 ] Tyler Palsulich commented on TIKA-879: -- [~lfcnassif], that seems like a reasonable solution. [~gagravarr], any objections to widening the range of the offset for magic detection? > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Labels: new-parser > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258264#comment-14258264 ] Luis Filipe Nassif commented on TIKA-879: - Nick, Could we add the extended offsets proposed by Wladimir? I am using 0:1000 range with the same patterns as him. It helped a lot to detect mails without extension and I am getting no false positives. > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257953#comment-14257953 ] Hudson commented on TIKA-879: - SUCCESS: Integrated in tika-trunk-jdk1.7 #388 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/388/]) Missing test file from TIKA-879 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1647725) * /tika/trunk/tika-parsers/src/test/resources/test-documents/testThunderbirdEml.eml > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257927#comment-14257927 ] Hudson commented on TIKA-879: - SUCCESS: Integrated in tika-trunk-jdk1.6 #372 (See [https://builds.apache.org/job/tika-trunk-jdk1.6/372/]) Missing test file from TIKA-879 (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1647725) * /tika/trunk/tika-parsers/src/test/resources/test-documents/testThunderbirdEml.eml TIKA-879 Add a new parent mime type, for the text based message formats, of text/x-tika-text-based-message, which allows Thunderbird messages to be correctly detected as they now show up as being text based not binary based in the hierarchy (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1647721) * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257918#comment-14257918 ] Hudson commented on TIKA-879: - UNSTABLE: Integrated in tika-trunk-jdk1.7 #387 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/387/]) TIKA-879 Add a new parent mime type, for the text based message formats, of text/x-tika-text-based-message, which allows Thunderbird messages to be correctly detected as they now show up as being text based not binary based in the hierarchy (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1647721) * /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14257888#comment-14257888 ] Nick Burch commented on TIKA-879: - I've done something a little different in r1647721 as a partial workaround - I've added a new mimetype of text/x-tika-text-based-message which is a parent of the 3 text-based message/ mimetypes (most of the message/ mimetypes are not text based). With that in place, Vladimir's test email is now correctly detected when it has a .eml extension. (I think that this parent probably is more semantically meaningful than text/plain, which is why I went for it) However, this doesn't solve the "detection without filename" issue, and some other related mail detection problems (eg multipart/signed). We might therefore want to think about adding a new mail detector, along the lines of the one suggested in this Stack Overflow question from a few weeks back - http://stackoverflow.com/questions/27397807/tika-detect-multipart-signed > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov >Assignee: Nick Burch > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14256984#comment-14256984 ] Luis Filipe Nassif commented on TIKA-879: - I have had this issue too with a lot of files and currently I am using the same workaround as [~vladimir_l]. > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > Attachments: TIKA-879-thunderbird.eml > > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502393#comment-13502393 ] Vladimir L. commented on TIKA-879: -- I figured out where the root of problem lays on. The original Tika configuration for "message/rfc822" is following: {code:xml} {code} Unfortunately the e-mail message I'm testing with has following header: {code} x-store-info:J++/JTCzmObr++wNraA4 . Authentication-Results: something.com; sender-id= .. X-SID-PRA: vladimi...@example.com X-SID-Result: Pass X-DKIM-Result: None X-AUTH-Result: PASS X-Message-Status: n:n X-Message-Delivery: Vj0xLjE7dXM . X-Message-Info: aKlYzGSc+Ll01bU5 Received: from mailout- Received: (qmail invoked by alias); 21 Nov 2012 20:11:35 - Received: from mp017. X-Authenticated: #2407 X-Provags-ID: V01U2FsdGVkX Received: (qmail 22194 invoked by uid 0); 21 Nov 2012 20:11:34 - Received: from Content-Type: text/plain; charset="utf-8" Date: Wed, 21 Nov 2012 21:11:32 +0100 From: "Vladimir L." Message-ID: <20121121201132.74...@example.com> MIME-Version: 1.0 Subject: JUnit test message To: vladimi...@something.com X-Flags: 0001 X-Mailer: WWW-Mail 6100 (Global Message Exchange) X-Priority: 3 Content-Transfer-Encoding: 8bit Return-Path: vladimi...@example.com X-OriginalArrivalTime: 21 Nov 2012 20:11:36.0285 Dear Vladimir . {code} As you can see none of the mentioned patterns is matching since they are all configured with offset="0" However the e-mail header defines the Content-Type: text/plain, which i assume influence the initial content type detection. The {{}} is not defined in mime-type definition, therefore auto-detection via extension *.eml fails for aforementioned reason of this issue. The current workaround for me is following: 1. Create {{custom-mimetypes.xml}} as described here: [http://tika.apache.org/1.0/parser_guide.html#Add_your_MIME-Type] 2. Add redefinition for "message/rfc822" mime-type as following: {code:xml} {code} Note the offset settings for *Message-ID:*, *Return-Path:*, and *Received:* I decided to leave fall-back to extension detection through definition of super-class {{text/plain}} Hope this will help to go around this issue. > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-879) Detection problem: message/rfc822 file is detected as text/plain.
[ https://issues.apache.org/jira/browse/TIKA-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13502287#comment-13502287 ] Vladimir L. commented on TIKA-879: -- I've got the same problem: trying to parse Outlook Express file: *.eml and get default "text/plain" Content-Type instead of expected "message/rfc822" I'm using org.apache.tika.parser.AutoDetectParser with default settings and during debugging came to the same conclusion as a reporter of this issue: "MediaTypeRegistry.isSpecializationOf("message/rfc822", "text/plain") returns false". If there is a way to vote for this bug to be fixed, or easy work around, please share it with us! > Detection problem: message/rfc822 file is detected as text/plain. > - > > Key: TIKA-879 > URL: https://issues.apache.org/jira/browse/TIKA-879 > Project: Tika > Issue Type: Bug > Components: metadata, mime >Affects Versions: 1.0, 1.1, 1.2 > Environment: linux 3.2.9 > oracle jdk7, openjdk7, sun jdk6 >Reporter: Konstantin Gribov > > When using {{DefaultDetector}} mime type for {{.eml}} files is different (you > can test it on {{testRFC822}} and {{testRFC822_base64}} in > {{tika-parsers/src/test/resources/test-documents/}}). > Main reason for such behavior is that only magic detector is really works for > such files. Even if you set {{CONTENT_TYPE}} in metadata or some {{.eml}} > file name in {{RESOURCE_NAME_KEY}}. > As I found {{MediaTypeRegistry.isSpecializationOf("message/rfc822", > "text/plain")}} returns {{false}}, so detection by {{MimeTypes.detect(...)}} > works only by magic. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira