[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683135#comment-17683135 ] Tim Allison commented on TIKA-2680: --- I _think_ we just fixed this on TIKA-3962. Please see the attached output and let me know. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: TIKA-2680-1.eml-2.7.0-prerc1.json, > main_email_in_outlook.jpg, nested.eml, pseudo-xml.xml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378092#comment-17378092 ] Tim Allison commented on TIKA-2680: --- I hacked out some pseudo xml (e.g. I didn't escape the body content) to show the structure of what we're getting from the latest version of james mime4j. I _looks_ like we should be able to handle this on the Tika level. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml, pseudo-xml.xml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377551#comment-17377551 ] Tim Allison commented on TIKA-2680: --- I haven't looked at this in 3 years. I think {{james}} has been updated since then. It clearly didn't automatically fix this problem. I can take a look, though, to see if we have better luck with our handling. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376884#comment-17376884 ] Abha commented on TIKA-2680: Is there any update on this issue? I still have the same issue for version 1.26 > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605737#comment-16605737 ] Tim Allison commented on TIKA-2680: --- [~grossws], would you be able to check if your attachment code works with the {{nested.eml}}. If it does, any chance you'd be able to share? I had something working for most of the child attachments, but it failed on the most deeply embedded file. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602361#comment-16602361 ] Konstantin Gribov commented on TIKA-2680: - Just my 2c, I've stopped using Tika for RFC822 parsing somewhere in 2012-2013 and using mime4j directly for RFC822 and delegate attachment parsing to Tika. But in my case I know beforehand what I'll parse (normal files, plain emls, emls with external metadata from DLP system or MSE journaled emls) so I can parse them with specific parser. Of course I have to track if I'm parsing an attachment (set/reset flag in field handler if {{Content-Disposition}} found with/without it; and reset flag in {{startBodyPart}}) and current depth in multipart tree handling. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595204#comment-16595204 ] Yury Kats commented on TIKA-2680: - Indeed, I don't think this is fixable w/o mime4j changes. See my comment in TIKA-2685 on what I did to make it work. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595108#comment-16595108 ] Tim Allison commented on TIKA-2680: --- Asked for help on the mime4j list: https://mail-archives.apache.org/mod_mbox/james-mime4j-dev/201808.mbox/%3CCAC1dCwUhrqM7rb0t7FzS8wm5T%2BX3T_cJSkBrj7gy9ocrydffqw%40mail.gmail.com%3E I've tried a number of options, and I'm not having luck with mime4j as it is. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536924#comment-16536924 ] Yury Kats commented on TIKA-2680: - I don't think this would be the right thing to do for journaled emails, where the first inner message/rfc822 is the email itself and the wrapper message provides additional headers for it. The top level message does not have any value on its own. It would be a very significant functionality change, likely to break clients that use tika for email extraction. It will definitely break things in our environment. If this is conditionalized on the presence of X-MS-Journal-Report header, or through a programmatic toggle (eg like setExtractAllAlternatives that was added in 1.17), then it would be ok. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535412#comment-16535412 ] Tim Allison commented on TIKA-2680: --- Given that Outlook appears to treat this as an attachment, are you ok if we do the same? !main_email_in_outlook.jpg! > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: main_email_in_outlook.jpg, nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535351#comment-16535351 ] Yury Kats commented on TIKA-2680: - Indeed, the first embedded rfc822 is not an attachment. I believe this is because it's an Exchange journaled email, see the presence of X-MS-Journal-Report header at the very top. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535339#comment-16535339 ] Tim Allison commented on TIKA-2680: --- Something like this? {noformat} multipart/mixed (uses _728aa617-16cf-4d95-8bc2-9f1868397202_) text/plain (_728aa617-16cf-4d95-8bc2-9f1868397202_) (sender and some other headers, no real content "Message-Id: <0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>") message/rfc822 (_728aa617-16cf-4d95-8bc2-9f1868397202_) uses (_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) multipart/alternative (_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) uses (_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) text/plain (_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) text/html (_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) ("cocacola Henry van der Smith") end _000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_ message/rfc822 (content-disposition: attachment) (_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_) uses (_004_8075737674787666767166806676697476787366657271727266777_) multipart/alternative (_004_8075737674787666767166806676697476787366657271727266777_) uses (_000_8075737674787666767166806676697476787366657271727266777_) text/plain (_000_8075737674787666767166806676697476787366657271727266777_) text/html (_000_8075737674787666767166806676697476787366657271727266777_) ("Cocacola test Henry van der Smith") end _000_8075737674787666767166806676697476787366657271727266777_ message/rfc822 (content-disposition: attachment text/plain) (004_8075737674787666767166806676697476787366657271727266777) no multipart body, just plain text: ("I won't be able to attend") end _004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_ end _728aa617-16cf-4d95-8bc2-9f1868397202_ {noformat} As you point out...it is mildly odd (to me at least) that the first embedded rfc822 (the one that uses _004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom) does not have content-disposition: attachment. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Assignee: Tim Allison >Priority: Major > Attachments: nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "or
[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted
[ https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528296#comment-16528296 ] Yury Kats commented on TIKA-2680: - This appears to be due to the fact that mime4j treats an attached email as "new message" (correctly) and not as a "part" of the original email. MailContentHandler#body is not being called. Instead MailContentHandler#startMessage is called, and thus MailContentHandler does not do any recursive parsing/extraction. The parts of the nested message are treated as parts of the original. > Email attachments to an email are not extracted > --- > > Key: TIKA-2680 > URL: https://issues.apache.org/jira/browse/TIKA-2680 > Project: Tika > Issue Type: Bug >Affects Versions: 1.18 >Reporter: Yury Kats >Priority: Major > Attachments: nested.eml > > > I have a number of email messages that contain other email messages as > attachments (with multiple levels of nesting). > The email attachments are parts with "Content-Type: message/rfc822" but are > not being recognized as such. > Attached is an example email, with the multiple levels of attachments: > * Subject: Test email within email > ** Subject: Email within email test > *** Subject: Stand-up today > > I would like to see 3 separate emails parsed out (top level, 1st level > attached email, 2nd level attached email), but I only get 1 email and 1 > unnamed text attachment: > {noformat} > $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool > [ > { > "Author": "Smith Van der, H (Henry) ", > "Content-Length": "16649", > "Content-Type": "message/rfc822", > "Creation-Date": "2018-04-25T12:46:41Z", > "Message-From": "Smith Van der, H (Henry) ", > "Message-To": [ > "fm.SAN Management Team ", > "Smith Van der, H (Henry) " > ], > "Message:From-Email": "henry.van.der.sm...@bank.com", > "Message:From-Name": "Smith Van der, H (Henry)", > "Message:Raw-Header:Auto-Submitted": "auto-generated", > "Message:Raw-Header:Content-Transfer-Encoding": "binary", > "Message:Raw-Header:Keywords": "", > "Message:Raw-Header:MIME-Version": "1.0", > "Message:Raw-Header:Message-ID": > "", > "Message:Raw-Header:Return-Path": "<>", > "Message:Raw-Header:Sender": > "", > "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent", > "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": > "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>", > "Message:Raw-Header:X-MS-Journal-Report": "", > "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.mail.RFC822Parser" > ], > "X-TIKA:parse_time_millis": "325", > "creator": "Smith Van der, H (Henry) ", > "dc:creator": "Smith Van der, H (Henry) ", > "dc:title": "Test email within email", > "dcterms:created": "2018-04-25T12:46:41Z", > "meta:author": "Smith Van der, H (Henry) ", > "meta:creation-date": "2018-04-25T12:46:41Z", > "resourceName": "nested.eml", > "subject": "Test email within email" > }, > { > "Content-Encoding": "US-ASCII", > "Content-Type": "text/plain; charset=US-ASCII", > "Multipart-Boundary": > "_004_8075737674787666767166806676697476787366657271727266777_", > "Multipart-Subtype": "mixed", > "X-Parsed-By": [ > "org.apache.tika.parser.DefaultParser", > "org.apache.tika.parser.txt.TXTParser" > ], > "X-TIKA:embedded_resource_path": "/embedded-1", > "X-TIKA:parse_time_millis": "5", > "embeddedResourceType": "ATTACHMENT" > } > ] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)