Gregory Lepore created TIKA-4447:
------------------------------------
Summary: eml attachement duplicate filename on extract
Key: TIKA-4447
URL: https://issues.apache.org/jira/browse/TIKA-4447
Project: Tika
Issue Type: Bug
Affects Versions: 3.2.0
Reporter: Gregory Lepore
Attachments: 12.eml
Not sure if this is a bug or something wrong with the source files. I'm
extracting and analyzing attachments from a huge set of eml files (originally
in pst format). However, attachments are getting the filename doubled on
extraction. For example, for the attached eml file I get:
java -jar /media/lepore/Work/tika/tika.jar --extract 12.eml
Extracting 'rtf-body.rtfrtf-body.rtf' (application/rtf) to
./cc9d8ebd-b93c-4235-b766-79b0aa841ef2-rtf-body.rtfrtf-body.rtf
Extracting '03-005 ACF GA Plan1.doc03-005 ACF GA Plan1.doc'
(application/msword) to ./0220432f-6dcc-4beb-b659-66be0fe0f60f-03-005 ACF GA
Plan1.doc03-005 ACF GA Plan1.doc
Extracting 'Talking Point1 1-17.docTalking Point1 1-17.doc'
(application/msword) to ./24bbaeab-448e-4d47-8b6d-ee9651156f89-Talking Point1
1-17.docTalking Point1 1-17.doc
All of the extracted file names are doubled. In the eml file I see:
Content-Type: application/msword
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename*=utf-8''Talking%20Point1%201-17.doc;
filename="Talking Point1 1-17.doc"
perhaps the doubled filename here is contributing to the problem?
Extracting the files with pffexport doesn't double the filename, but ripmime
has trouble, and munpack also has trouble.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)