[
https://issues.apache.org/jira/browse/MIME4J-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank Fodera updated MIME4J-281:
--------------------------------
Description:
*Summary*
We are leveraging Tika 1.18 to parse and extract emails which includes James
Mime4j version 0.8.1. One of our customers attempted to parse an mbox file
which contained an email that had a base64 encoded PDF attachment. While
opening the mbox file, we noticed that the attached PDF was encoded in a way
that each line was 80 characters and padded with == however we can't change how
they encoded it and we don't know what they used to do so. Later, when
attempting to send the extracted PDF to be parsed, it fails because the PDF was
only partially extracted and is not a valid format.
It appears that in MimeEntity (decodeStream method) it determines the
Inputstream is Base64 encoded and wraps the LineReaderInputStreamAdaptor to a
Base64Inputstream. When later reading from the stream, the read0 method simply
checks for a BASE64_PAD and marks it as EOF despite having additional content
to be parsed.
*Code to Help Reproduce:*
{noformat}
public static void main (String [] args) throws Exception {
File initialFile = new File("/path/to/file/base64File.txt");
InputStream inputStream = new FileInputStream(initialFile);
org.apache.james.mime4j.io.LineReaderInputStreamAdaptor
lineReaderInputStream = new LineReaderInputStreamAdaptor(inputStream);
InputStream base64InputStream = new
org.apache.james.mime4j.codec.Base64InputStream(lineReaderInputStream);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
org.apache.tika.io.IOUtils.copy(base64InputStream, bos);
}{noformat}
Leveraging the code above you can see that the encoded PDF (contained in
base64File.txt) only extracts out the first line instead of the entire PDF.
*Extracting the MBOX via Tika 1.18*
{noformat}
[user]$ java -jar tika-app-1.18.jar -m -J
~/Downloads/[email protected] | python -m
json.tool
Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
[
{
"Content-Encoding": "windows-1252",
"Content-Length": "366503",
"Content-Type": "application/mbox",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.mbox.MboxParser"
],
"X-TIKA:parse_time_millis": "199",
"resourceName": "[email protected]"
},
{
"Content-Disposition": "attachment;
filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
"Content-Type": "application/pdf",
"Multipart-Boundary": "===============6812308677685932777==",
"Multipart-Subtype": "mixed",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pdf.PDFParser"
],
"X-TIKA:EXCEPTION:embedded_exception":
"org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.pdf.PDFParser@45f45fa1\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat
org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
org.apache.tika.parser.mail.MailContentHandler.handleEmbedded(MailContentHandler.java:283)\n\tat
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:228)\n\tat
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)\n\tat
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:100)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat
org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
org.apache.tika.parser.mbox.MboxParser.parse(MboxParser.java:135)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)\n\tat
org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:507)\n\tat
org.apache.tika.cli.TikaCLI.process(TikaCLI.java:481)\n\tat
org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)\nCaused by:
java.io.IOException: Missing root object specification in trailer.\n\tat
org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2727)\n\tat
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:193)\n\tat
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)\n\tat
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144)\n\tat
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1117)\n\tat
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\t...
25 more\n",
"X-TIKA:embedded_resource_path":
"/embedded-1/As_Cool_as_I_Am_(film).pdf",
"X-TIKA:parse_time_millis": "52",
"embeddedResourceType": "ATTACHMENT",
"resourceName": "/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf"
},
{
"Author": [
"[email protected]",
"[email protected]"
],
"Content-Type": "message/rfc822",
"Content-Type-Override": "message/rfc822",
"MboxParser-content-disposition": "attachment;",
"MboxParser-content-transfer-encoding": [
"7bit",
"base64"
],
"MboxParser-from": "[email protected] Wed May 16 09:17:10 2018",
"MboxParser-mime-version": [
"1.0",
"1.0"
],
"MboxParser-return-path": "<[email protected]>
filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
"Message-From": "[email protected]",
"Message-Recipient-Address": "[email protected]",
"Message-To": [
"[email protected]",
"[email protected]"
],
"Message:From-Email": "[email protected]",
"Message:Raw-Header:MIME-Version": "1.0",
"Message:Raw-Header:Return-Path": "<[email protected]>",
"Multipart-Boundary": "===============6812308677685932777==",
"Multipart-Subtype": "mixed",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.mail.RFC822Parser"
],
"X-TIKA:embedded_resource_path": "/embedded-1",
"X-TIKA:parse_time_millis": "124",
"creator": [
"[email protected]",
"[email protected]"
],
"dc:creator": [
"[email protected]",
"[email protected]"
],
"dc:format": "application/pdf",
"dc:title": "Side question local book claim.",
"format": "application/pdf",
"meta:author": [
"[email protected]",
"[email protected]"
],
"subject": "Side question local book claim."
}
]{noformat}
*Attached Files*
# The customer's original mbox file
([email protected])
# The base64 encoded PDF in it's own file (base64File.txt)
# The extracted PDF standalone (As_Cool_as_I_Am_(film).pdf)
was:
*Summary*
We are leveraging Tika 1.18 to parse and extract emails which includes James
Mime4j version 0.8.1. One of our customers attempted to parse an mbox file
which contained an email that had a base64 encoded PDF attachment. While
opening the mbox file, we noticed that the attached PDF was encoded in a way
that each line was 80 characters and padded with == however we can't change how
they encoded it and we don't know what they used to do so. Later, when
attempting to send the extracted PDF to be parsed, it fails because the PDF was
only partially extracted and is not a valid format.
It appears that in MimeEntity (decodeStream method) it determines the
Inputstream is Base64 encoded and wraps the LineReaderInputStreamAdaptor to a
Base64Inputstream. When later reading from the stream, the read0 method simply
checks for a BASE64_PAD and marks it as EOF despite having additional content
to be parsed.
*Code to Help Reproduce:*
{noformat}
public static void main (String [] args) throws Exception {
File initialFile = new File("/path/to/file/base64File.txt");
InputStream inputStream = new FileInputStream(initialFile);
org.apache.james.mime4j.io.LineReaderInputStreamAdaptor
lineReaderInputStream = new LineReaderInputStreamAdaptor(inputStream);
InputStream base64InputStream = new
org.apache.james.mime4j.codec.Base64InputStream(lineReaderInputStream);
ByteArrayOutputStream bos = new ByteArrayOutputStream();
org.apache.tika.io.IOUtils.copy(base64InputStream, bos);
}{noformat}
Leveraging the code above you can see that the encoded PDF (contained in
base64File.txt) only extracts out the first line instead of the entire PDF.
*Extracting the MBOX via Tika 1.18*
{noformat}
[user]$ java -jar tika-app-1.18.jar -m -J
~/Downloads/[email protected] | python -m
json.tool
Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
[
{
"Content-Encoding": "windows-1252",
"Content-Length": "366503",
"Content-Type": "application/mbox",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.mbox.MboxParser"
],
"X-TIKA:parse_time_millis": "199",
"resourceName": "[email protected]"
},
{
"Content-Disposition": "attachment;
filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
"Content-Type": "application/pdf",
"Multipart-Boundary": "===============6812308677685932777==",
"Multipart-Subtype": "mixed",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.pdf.PDFParser"
],
"X-TIKA:EXCEPTION:embedded_exception":
"org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.pdf.PDFParser@45f45fa1\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat
org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
org.apache.tika.parser.mail.MailContentHandler.handleEmbedded(MailContentHandler.java:283)\n\tat
org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:228)\n\tat
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)\n\tat
org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:100)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat
org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat
org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
org.apache.tika.parser.mbox.MboxParser.parse(MboxParser.java:135)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)\n\tat
org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:507)\n\tat
org.apache.tika.cli.TikaCLI.process(TikaCLI.java:481)\n\tat
org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)\nCaused by:
java.io.IOException: Missing root object specification in trailer.\n\tat
org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2727)\n\tat
org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:193)\n\tat
org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)\n\tat
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144)\n\tat
org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1117)\n\tat
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)\n\tat
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\t...
25 more\n",
"X-TIKA:embedded_resource_path":
"/embedded-1/As_Cool_as_I_Am_(film).pdf",
"X-TIKA:parse_time_millis": "52",
"embeddedResourceType": "ATTACHMENT",
"resourceName": "/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf"
},
{
"Author": [
"[email protected]",
"[email protected]"
],
"Content-Type": "message/rfc822",
"Content-Type-Override": "message/rfc822",
"MboxParser-content-disposition": "attachment;",
"MboxParser-content-transfer-encoding": [
"7bit",
"base64"
],
"MboxParser-from": "[email protected] Wed May 16 09:17:10 2018",
"MboxParser-mime-version": [
"1.0",
"1.0"
],
"MboxParser-return-path": "<[email protected]>
filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
"Message-From": "[email protected]",
"Message-Recipient-Address": "[email protected]",
"Message-To": [
"[email protected]",
"[email protected]"
],
"Message:From-Email": "[email protected]",
"Message:Raw-Header:MIME-Version": "1.0",
"Message:Raw-Header:Return-Path": "<[email protected]>",
"Multipart-Boundary": "===============6812308677685932777==",
"Multipart-Subtype": "mixed",
"X-Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.mail.RFC822Parser"
],
"X-TIKA:embedded_resource_path": "/embedded-1",
"X-TIKA:parse_time_millis": "124",
"creator": [
"[email protected]",
"[email protected]"
],
"dc:creator": [
"[email protected]",
"[email protected]"
],
"dc:format": "application/pdf",
"dc:title": "Side question local book claim.",
"format": "application/pdf",
"meta:author": [
"[email protected]",
"[email protected]"
],
"subject": "Side question local book claim."
}
]{noformat}
*Attached Files*
# The customer's original mbox file
([email protected])
# The base64 encoded PDF in it's own file (base64File.txt)
# The extracted PDF standalone (As_Cool_as_I_Am_(film).pdf)
> A Base64 stream which contains padding on each line only decodes the first
> line
> -------------------------------------------------------------------------------
>
> Key: MIME4J-281
> URL: https://issues.apache.org/jira/browse/MIME4J-281
> Project: James Mime4j
> Issue Type: Bug
> Affects Versions: 0.8.1
> Reporter: Frank Fodera
> Priority: Major
> Attachments: [email protected],
> As_Cool_as_I_Am_(film).pdf, base64File.txt
>
>
> *Summary*
> We are leveraging Tika 1.18 to parse and extract emails which includes James
> Mime4j version 0.8.1. One of our customers attempted to parse an mbox file
> which contained an email that had a base64 encoded PDF attachment. While
> opening the mbox file, we noticed that the attached PDF was encoded in a way
> that each line was 80 characters and padded with == however we can't change
> how they encoded it and we don't know what they used to do so. Later, when
> attempting to send the extracted PDF to be parsed, it fails because the PDF
> was only partially extracted and is not a valid format.
> It appears that in MimeEntity (decodeStream method) it determines the
> Inputstream is Base64 encoded and wraps the LineReaderInputStreamAdaptor to a
> Base64Inputstream. When later reading from the stream, the read0 method
> simply checks for a BASE64_PAD and marks it as EOF despite having additional
> content to be parsed.
>
> *Code to Help Reproduce:*
> {noformat}
> public static void main (String [] args) throws Exception {
> File initialFile = new File("/path/to/file/base64File.txt");
> InputStream inputStream = new FileInputStream(initialFile);
> org.apache.james.mime4j.io.LineReaderInputStreamAdaptor
> lineReaderInputStream = new LineReaderInputStreamAdaptor(inputStream);
> InputStream base64InputStream = new
> org.apache.james.mime4j.codec.Base64InputStream(lineReaderInputStream);
> ByteArrayOutputStream bos = new ByteArrayOutputStream();
> org.apache.tika.io.IOUtils.copy(base64InputStream, bos);
> }{noformat}
> Leveraging the code above you can see that the encoded PDF (contained in
> base64File.txt) only extracts out the first line instead of the entire PDF.
>
> *Extracting the MBOX via Tika 1.18*
> {noformat}
> [user]$ java -jar tika-app-1.18.jar -m -J
> ~/Downloads/[email protected] | python -m
> json.tool
> Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
>
> Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> [
> {
> "Content-Encoding": "windows-1252",
> "Content-Length": "366503",
> "Content-Type": "application/mbox",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mbox.MboxParser"
> ],
> "X-TIKA:parse_time_millis": "199",
> "resourceName": "[email protected]"
> },
> {
> "Content-Disposition": "attachment;
> filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
> "Content-Type": "application/pdf",
> "Multipart-Boundary": "===============6812308677685932777==",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.pdf.PDFParser"
> ],
> "X-TIKA:EXCEPTION:embedded_exception":
> "org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@45f45fa1\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
>
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
>
> org.apache.tika.parser.mail.MailContentHandler.handleEmbedded(MailContentHandler.java:283)\n\tat
>
> org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:228)\n\tat
>
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)\n\tat
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:100)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
>
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
> org.apache.tika.parser.mbox.MboxParser.parse(MboxParser.java:135)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
>
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)\n\tat
> org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:507)\n\tat
> org.apache.tika.cli.TikaCLI.process(TikaCLI.java:481)\n\tat
> org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)\nCaused by:
> java.io.IOException: Missing root object specification in trailer.\n\tat
> org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2727)\n\tat
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:193)\n\tat
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)\n\tat
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144)\n\tat
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1117)\n\tat
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\t...
> 25 more\n",
> "X-TIKA:embedded_resource_path":
> "/embedded-1/As_Cool_as_I_Am_(film).pdf",
> "X-TIKA:parse_time_millis": "52",
> "embeddedResourceType": "ATTACHMENT",
> "resourceName":
> "/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf"
> },
> {
> "Author": [
> "[email protected]",
> "[email protected]"
> ],
> "Content-Type": "message/rfc822",
> "Content-Type-Override": "message/rfc822",
> "MboxParser-content-disposition": "attachment;",
> "MboxParser-content-transfer-encoding": [
> "7bit",
> "base64"
> ],
> "MboxParser-from": "[email protected] Wed May 16 09:17:10 2018",
> "MboxParser-mime-version": [
> "1.0",
> "1.0"
> ],
> "MboxParser-return-path": "<[email protected]>
> filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
> "Message-From": "[email protected]",
> "Message-Recipient-Address": "[email protected]",
> "Message-To": [
> "[email protected]",
> "[email protected]"
> ],
> "Message:From-Email": "[email protected]",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Return-Path": "<[email protected]>",
> "Multipart-Boundary": "===============6812308677685932777==",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "124",
> "creator": [
> "[email protected]",
> "[email protected]"
> ],
> "dc:creator": [
> "[email protected]",
> "[email protected]"
> ],
> "dc:format": "application/pdf",
> "dc:title": "Side question local book claim.",
> "format": "application/pdf",
> "meta:author": [
> "[email protected]",
> "[email protected]"
> ],
> "subject": "Side question local book claim."
> }
> ]{noformat}
>
> *Attached Files*
> # The customer's original mbox file
> ([email protected])
> # The base64 encoded PDF in it's own file (base64File.txt)
> # The extracted PDF standalone (As_Cool_as_I_Am_(film).pdf)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)