[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2023-02-01 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683135#comment-17683135
 ] 

Tim Allison commented on TIKA-2680:
---

I _think_ we just fixed this on TIKA-3962.  Please see the attached output and 
let me know.

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: TIKA-2680-1.eml-2.7.0-prerc1.json, 
> main_email_in_outlook.jpg, nested.eml, pseudo-xml.xml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2021-07-09 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17378092#comment-17378092
 ] 

Tim Allison commented on TIKA-2680:
---

I hacked out some pseudo xml (e.g. I didn't escape the body content) to show 
the structure of what we're getting from the latest version of james mime4j.

I _looks_ like we should be able to handle this on the Tika level.

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml, pseudo-xml.xml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2021-07-08 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17377551#comment-17377551
 ] 

Tim Allison commented on TIKA-2680:
---

I haven't looked at this in 3 years.  I think {{james}} has been updated since 
then.  It clearly didn't automatically fix this problem.  I can take a look, 
though, to see if we have better luck with our handling.

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2021-07-07 Thread Abha (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376884#comment-17376884
 ] 

Abha commented on TIKA-2680:


Is there any update on this issue? I still have the same issue for version 1.26

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-09-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605737#comment-16605737
 ] 

Tim Allison commented on TIKA-2680:
---

[~grossws], would you be able to check if your attachment code works with the 
{{nested.eml}}.  If it does, any chance you'd be able to share?  I had 
something working for most of the child attachments, but it failed on the most 
deeply embedded file.

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-09-03 Thread Konstantin Gribov (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16602361#comment-16602361
 ] 

Konstantin Gribov commented on TIKA-2680:
-

Just my 2c, I've stopped using Tika for RFC822 parsing somewhere in 2012-2013 
and using mime4j directly for RFC822 and delegate attachment parsing to Tika. 
But in my case I know beforehand what I'll parse (normal files, plain emls, 
emls with external metadata from DLP system or MSE journaled emls) so I can 
parse them with specific parser. Of course I have to track if I'm parsing an 
attachment (set/reset flag in field handler if {{Content-Disposition}} found 
with/without it; and reset flag in {{startBodyPart}}) and current depth in 
multipart tree handling.

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-08-28 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595204#comment-16595204
 ] 

Yury Kats commented on TIKA-2680:
-

Indeed, I don't think this is fixable w/o mime4j changes. See my comment in 
TIKA-2685 on what I did to make it work.

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-08-28 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16595108#comment-16595108
 ] 

Tim Allison commented on TIKA-2680:
---

Asked for help on the mime4j list: 
https://mail-archives.apache.org/mod_mbox/james-mime4j-dev/201808.mbox/%3CCAC1dCwUhrqM7rb0t7FzS8wm5T%2BX3T_cJSkBrj7gy9ocrydffqw%40mail.gmail.com%3E

I've tried a number of options, and I'm not having luck with mime4j as it is.

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-07-09 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16536924#comment-16536924
 ] 

Yury Kats commented on TIKA-2680:
-

I don't think this would be the right thing to do for journaled emails, where 
the first inner message/rfc822 is the email itself and the wrapper message 
provides additional headers for it. The top level message does not have any 
value on its own.

It would be a very significant functionality change, likely to break clients 
that use tika for email extraction. It will definitely break things in our 
environment. 

If this is conditionalized on the presence of X-MS-Journal-Report header, or 
through a programmatic toggle (eg like setExtractAllAlternatives that was added 
in 1.17), then it would be ok.

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535412#comment-16535412
 ] 

Tim Allison commented on TIKA-2680:
---

Given that Outlook appears to treat this as an attachment, are you ok if we do 
the same? !main_email_in_outlook.jpg! 

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: main_email_in_outlook.jpg, nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-07-06 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535351#comment-16535351
 ] 

Yury Kats commented on TIKA-2680:
-

Indeed, the first embedded rfc822 is not an attachment. I believe this is 
because it's an Exchange journaled email, see the presence of 
X-MS-Journal-Report header at the very top. 

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-07-06 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16535339#comment-16535339
 ] 

Tim Allison commented on TIKA-2680:
---

Something like this?

{noformat}
multipart/mixed (uses _728aa617-16cf-4d95-8bc2-9f1868397202_)
text/plain (_728aa617-16cf-4d95-8bc2-9f1868397202_)  
(sender and some other headers, no real content "Message-Id: 
  <0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>")
message/rfc822 (_728aa617-16cf-4d95-8bc2-9f1868397202_)
uses 
(_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
multipart/alternative 
(_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
uses 
(_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
text/plain 
(_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
text/html 
(_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
 ("cocacola Henry van 
der Smith")
end 
_000_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_
message/rfc822 (content-disposition: attachment) 
 
(_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_)
uses 
(_004_8075737674787666767166806676697476787366657271727266777_)
multipart/alternative 
(_004_8075737674787666767166806676697476787366657271727266777_)
uses 
(_000_8075737674787666767166806676697476787366657271727266777_)
text/plain 
(_000_8075737674787666767166806676697476787366657271727266777_)
text/html 
(_000_8075737674787666767166806676697476787366657271727266777_) 
 ("Cocacola test 
Henry van der Smith")
end 
_000_8075737674787666767166806676697476787366657271727266777_
message/rfc822 (content-disposition: attachment 
text/plain)
 
(004_8075737674787666767166806676697476787366657271727266777)
no multipart body, just plain 
text: ("I won't be able to attend")
end _004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom_
end _728aa617-16cf-4d95-8bc2-9f1868397202_
{noformat}

As you point out...it is mildly odd (to me at least) that the first embedded 
rfc822 (the one that uses 
_004_0fab98cd190c41f199a25c73f78a2070BSTS124002eubanknetcom) does not have 
content-disposition: attachment.

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Assignee: Tim Allison
>Priority: Major
> Attachments: nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "or

[jira] [Commented] (TIKA-2680) Email attachments to an email are not extracted

2018-06-29 Thread Yury Kats (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528296#comment-16528296
 ] 

Yury Kats commented on TIKA-2680:
-

This appears to be due to the fact that mime4j treats an attached email as "new 
message" (correctly) and not as a "part" of the original email.

MailContentHandler#body is not being called. Instead 
MailContentHandler#startMessage is called, and thus MailContentHandler does not 
do any recursive parsing/extraction. The parts of the nested message are 
treated as parts of the original.

> Email attachments to an email are not extracted
> ---
>
> Key: TIKA-2680
> URL: https://issues.apache.org/jira/browse/TIKA-2680
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.18
>Reporter: Yury Kats
>Priority: Major
> Attachments: nested.eml
>
>
> I have a number of email messages that contain other email messages as 
> attachments (with multiple levels of nesting).
> The email attachments are parts with "Content-Type: message/rfc822" but are 
> not being recognized as such.
> Attached is an example email, with the multiple levels of attachments:
>  * Subject: Test email within email
>  ** Subject: Email within email test
>  *** Subject: Stand-up today
>  
> I would like to see 3 separate emails parsed out (top level, 1st level 
> attached email, 2nd level attached email), but I only get 1 email and 1 
> unnamed text attachment:
> {noformat}
> $ java -jar tika-app-1.18.jar -m -J nested.eml | python -m json.tool
> [
> {
> "Author": "Smith Van der, H (Henry) ",
> "Content-Length": "16649",
> "Content-Type": "message/rfc822",
> "Creation-Date": "2018-04-25T12:46:41Z",
> "Message-From": "Smith Van der, H (Henry) ",
> "Message-To": [
> "fm.SAN Management Team ",
> "Smith Van der, H (Henry) "
> ],
> "Message:From-Email": "henry.van.der.sm...@bank.com",
> "Message:From-Name": "Smith Van der, H (Henry)",
> "Message:Raw-Header:Auto-Submitted": "auto-generated",
> "Message:Raw-Header:Content-Transfer-Encoding": "binary",
> "Message:Raw-Header:Keywords": "",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Message-ID": 
> "",
> "Message:Raw-Header:Return-Path": "<>",
> "Message:Raw-Header:Sender": 
> "",
> "Message:Raw-Header:X-MS-Exchange-Generated-Message-Source": "Journal Agent",
> "Message:Raw-Header:X-MS-Exchange-Parent-Message-Id": 
> "<0fab98cd190c41f199a25c73f78a2...@bsts124002.eu.banknet.com>",
> "Message:Raw-Header:X-MS-Journal-Report": "",
> "Multipart-Boundary": "_728aa617-16cf-4d95-8bc2-9f1868397202_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:parse_time_millis": "325",
> "creator": "Smith Van der, H (Henry) ",
> "dc:creator": "Smith Van der, H (Henry) ",
> "dc:title": "Test email within email",
> "dcterms:created": "2018-04-25T12:46:41Z",
> "meta:author": "Smith Van der, H (Henry) ",
> "meta:creation-date": "2018-04-25T12:46:41Z",
> "resourceName": "nested.eml",
> "subject": "Test email within email"
> },
> {
> "Content-Encoding": "US-ASCII",
> "Content-Type": "text/plain; charset=US-ASCII",
> "Multipart-Boundary": 
> "_004_8075737674787666767166806676697476787366657271727266777_",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.txt.TXTParser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "5",
> "embeddedResourceType": "ATTACHMENT"
> }
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)