[
https://issues.apache.org/jira/browse/MIME4J-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Frank Fodera updated MIME4J-281:
--------------------------------
Attachment: As_Cool_as_I_Am_(film).pdf
> A Base64 stream which contains padding on each line only decodes the first
> line
> -------------------------------------------------------------------------------
>
> Key: MIME4J-281
> URL: https://issues.apache.org/jira/browse/MIME4J-281
> Project: James Mime4j
> Issue Type: Bug
> Affects Versions: 0.8.1
> Reporter: Frank Fodera
> Priority: Major
> Attachments: [email protected],
> As_Cool_as_I_Am_(film).pdf, base64File.txt
>
>
> *Summary*
> We are leveraging Tika 1.18 to parse and extract emails which includes James
> Mime4j version 0.8.1. One of our customers attempted to parse an mbox file
> which contained an email that had a base64 encoded PDF attachment. While
> opening the mbox file, we noticed that the attached PDF was encoded in a way
> that each line was 80 characters and padded with == however we can't change
> how they encoded it and we don't know what they used to do so. Later, when
> attempting to send the extracted PDF to be parsed, it fails because the PDF
> was only partially extracted and is not a valid format.
> It appears that in MimeEntity (decodeStream method) it determines the
> Inputstream is Base64 encoded and wraps the LineReaderInputStreamAdaptor to a
> Base64Inputstream. When later reading from the stream, the read0 method
> simply checks for a BASE64_PAD and marks it as EOF despite having additional
> content to be parsed.
>
> *Code to Help Reproduce:*
> {noformat}
> public static void main (String [] args) throws Exception {
> File initialFile = new File("/path/to/file/base64File.txt");
> InputStream inputStream = new FileInputStream(initialFile);
> org.apache.james.mime4j.io.LineReaderInputStreamAdaptor
> lineReaderInputStream = new LineReaderInputStreamAdaptor(inputStream);
> InputStream base64InputStream = new
> org.apache.james.mime4j.codec.Base64InputStream(lineReaderInputStream);
> ByteArrayOutputStream bos = new ByteArrayOutputStream();
> org.apache.tika.io.IOUtils.copy(base64InputStream, bos);
> }{noformat}
> Leveraging the code above you can see that the encoded PDF (contained in
> base64File.txt) only extracts out the first line instead of the entire PDF.
>
> *Extracting the MBOX via Tika 1.18*
> {noformat}
> [user]$ java -jar tika-app-1.18.jar -m -J
> ~/Downloads/[email protected] | python -m
> json.tool
> Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
>
> Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> [
> {
> "Content-Encoding": "windows-1252",
> "Content-Length": "366503",
> "Content-Type": "application/mbox",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mbox.MboxParser"
> ],
> "X-TIKA:parse_time_millis": "199",
> "resourceName": "[email protected]"
> },
> {
> "Content-Disposition": "attachment;
> filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
> "Content-Type": "application/pdf",
> "Multipart-Boundary": "===============6812308677685932777==",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.pdf.PDFParser"
> ],
> "X-TIKA:EXCEPTION:embedded_exception":
> "org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.pdf.PDFParser@45f45fa1\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
>
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
>
> org.apache.tika.parser.mail.MailContentHandler.handleEmbedded(MailContentHandler.java:283)\n\tat
>
> org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:228)\n\tat
>
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)\n\tat
> org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:100)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
>
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
> org.apache.tika.parser.mbox.MboxParser.parse(MboxParser.java:135)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
>
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)\n\tat
> org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:507)\n\tat
> org.apache.tika.cli.TikaCLI.process(TikaCLI.java:481)\n\tat
> org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)\nCaused by:
> java.io.IOException: Missing root object specification in trailer.\n\tat
> org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2727)\n\tat
> org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:193)\n\tat
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)\n\tat
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144)\n\tat
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1117)\n\tat
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)\n\tat
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\t...
> 25 more\n",
> "X-TIKA:embedded_resource_path":
> "/embedded-1/As_Cool_as_I_Am_(film).pdf",
> "X-TIKA:parse_time_millis": "52",
> "embeddedResourceType": "ATTACHMENT",
> "resourceName":
> "/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf"
> },
> {
> "Author": [
> "[email protected]",
> "[email protected]"
> ],
> "Content-Type": "message/rfc822",
> "Content-Type-Override": "message/rfc822",
> "MboxParser-content-disposition": "attachment;",
> "MboxParser-content-transfer-encoding": [
> "7bit",
> "base64"
> ],
> "MboxParser-from": "[email protected] Wed May 16 09:17:10 2018",
> "MboxParser-mime-version": [
> "1.0",
> "1.0"
> ],
> "MboxParser-return-path": "<[email protected]>
> filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
> "Message-From": "[email protected]",
> "Message-Recipient-Address": "[email protected]",
> "Message-To": [
> "[email protected]",
> "[email protected]"
> ],
> "Message:From-Email": "[email protected]",
> "Message:Raw-Header:MIME-Version": "1.0",
> "Message:Raw-Header:Return-Path": "<[email protected]>",
> "Multipart-Boundary": "===============6812308677685932777==",
> "Multipart-Subtype": "mixed",
> "X-Parsed-By": [
> "org.apache.tika.parser.DefaultParser",
> "org.apache.tika.parser.mail.RFC822Parser"
> ],
> "X-TIKA:embedded_resource_path": "/embedded-1",
> "X-TIKA:parse_time_millis": "124",
> "creator": [
> "[email protected]",
> "[email protected]"
> ],
> "dc:creator": [
> "[email protected]",
> "[email protected]"
> ],
> "dc:format": "application/pdf",
> "dc:title": "Side question local book claim.",
> "format": "application/pdf",
> "meta:author": [
> "[email protected]",
> "[email protected]"
> ],
> "subject": "Side question local book claim."
> }
> ]{noformat}
>
> *Attached Files*
> # The customer's original mbox file
> ([email protected])
> # The base64 encoded PDF in it's own file (base64File.txt)
> # The extracted PDF standalone (As_Cool_as_I_Am_(film).pdf)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)