[ 
https://issues.apache.org/jira/browse/MIME4J-281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Fodera updated MIME4J-281:
--------------------------------
    Attachment: [email protected]

> A Base64 stream which contains padding on each line only decodes the first 
> line
> -------------------------------------------------------------------------------
>
>                 Key: MIME4J-281
>                 URL: https://issues.apache.org/jira/browse/MIME4J-281
>             Project: James Mime4j
>          Issue Type: Bug
>    Affects Versions: 0.8.1
>            Reporter: Frank Fodera
>            Priority: Major
>         Attachments: [email protected], 
> As_Cool_as_I_Am_(film).pdf, base64File.txt
>
>
> *Summary*
>  We are leveraging Tika 1.18 to parse and extract emails which includes James 
> Mime4j version 0.8.1. One of our customers attempted to parse an mbox file 
> which contained an email that had a base64 encoded PDF attachment. While 
> opening the mbox file, we noticed that the attached PDF was encoded in a way 
> that each line was 80 characters and padded with == however we can't change 
> how they encoded it and we don't know what they used to do so. Later, when 
> attempting to send the extracted PDF to be parsed, it fails because the PDF 
> was only partially extracted and is not a valid format.
> It appears that in MimeEntity (decodeStream method) it determines the 
> Inputstream is Base64 encoded and wraps the LineReaderInputStreamAdaptor to a 
> Base64Inputstream. When later reading from the stream, the read0 method 
> simply checks for a BASE64_PAD and marks it as EOF despite having additional 
> content to be parsed.
>  
> *Code to Help Reproduce:*
> {noformat}
> public static void main (String [] args) throws Exception {
>     File initialFile = new    File("/path/to/file/base64File.txt");
>     InputStream inputStream = new FileInputStream(initialFile);
>     org.apache.james.mime4j.io.LineReaderInputStreamAdaptor 
> lineReaderInputStream = new LineReaderInputStreamAdaptor(inputStream);
>     InputStream base64InputStream = new 
> org.apache.james.mime4j.codec.Base64InputStream(lineReaderInputStream);
>     ByteArrayOutputStream bos = new ByteArrayOutputStream();
>     org.apache.tika.io.IOUtils.copy(base64InputStream, bos);
> }{noformat}
> Leveraging the code above you can see that the encoded PDF (contained in 
> base64File.txt) only extracts out the first line instead of the entire PDF.
>  
> *Extracting the MBOX via Tika 1.18*
> {noformat}
> [user]$ java -jar tika-app-1.18.jar -m -J  
> ~/Downloads/[email protected] | python -m 
> json.tool
> Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.
>  
> Jun 25, 2018 11:55:55 AM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> [
>     {
>         "Content-Encoding": "windows-1252",
>         "Content-Length": "366503",
>         "Content-Type": "application/mbox",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.mbox.MboxParser"
>         ],
>         "X-TIKA:parse_time_millis": "199",
>         "resourceName": "[email protected]"
>     },
>     {
>         "Content-Disposition": "attachment; 
> filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
>         "Content-Type": "application/pdf",
>         "Multipart-Boundary": "===============6812308677685932777==",
>         "Multipart-Subtype": "mixed",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.pdf.PDFParser"
>         ],
>         "X-TIKA:EXCEPTION:embedded_exception": 
> "org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
> org.apache.tika.parser.pdf.PDFParser@45f45fa1\n\tat 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:286)\n\tat 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
>  org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
>  
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
>  
> org.apache.tika.parser.mail.MailContentHandler.handleEmbedded(MailContentHandler.java:283)\n\tat
>  
> org.apache.tika.parser.mail.MailContentHandler.body(MailContentHandler.java:228)\n\tat
>  
> org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:133)\n\tat
>  org.apache.tika.parser.mail.RFC822Parser.parse(RFC822Parser.java:100)\n\tat 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
>  org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)\n\tat 
> org.apache.tika.parser.RecursiveParserWrapper$EmbeddedParserDecorator.parse(RecursiveParserWrapper.java:318)\n\tat
>  
> org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)\n\tat 
> org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)\n\tat
>  org.apache.tika.parser.mbox.MboxParser.parse(MboxParser.java:135)\n\tat 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\tat 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)\n\tat
>  
> org.apache.tika.parser.RecursiveParserWrapper.parse(RecursiveParserWrapper.java:159)\n\tat
>  org.apache.tika.cli.TikaCLI.handleRecursiveJson(TikaCLI.java:507)\n\tat 
> org.apache.tika.cli.TikaCLI.process(TikaCLI.java:481)\n\tat 
> org.apache.tika.cli.TikaCLI.main(TikaCLI.java:145)\nCaused by: 
> java.io.IOException: Missing root object specification in trailer.\n\tat 
> org.apache.pdfbox.pdfparser.COSParser.parseTrailerValuesDynamically(COSParser.java:2727)\n\tat
>  org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:193)\n\tat 
> org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:240)\n\tat 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1144)\n\tat 
> org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1117)\n\tat 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:153)\n\tat 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)\n\t... 
> 25 more\n",
>         "X-TIKA:embedded_resource_path": 
> "/embedded-1/As_Cool_as_I_Am_(film).pdf",
>         "X-TIKA:parse_time_millis": "52",
>         "embeddedResourceType": "ATTACHMENT",
>         "resourceName": 
> "/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf"
>     },
>     {
>         "Author": [
>             "[email protected]",
>             "[email protected]"
>         ],
>         "Content-Type": "message/rfc822",
>         "Content-Type-Override": "message/rfc822",
>         "MboxParser-content-disposition": "attachment;",
>         "MboxParser-content-transfer-encoding": [
>             "7bit",
>             "base64"
>         ],
>         "MboxParser-from": "[email protected] Wed May 16 09:17:10 2018",
>         "MboxParser-mime-version": [
>             "1.0",
>             "1.0"
>         ],
>         "MboxParser-return-path": "<[email protected]> 
> filename=\"/home/test/test/attachments/As_Cool_as_I_Am_(film).pdf\"",
>         "Message-From": "[email protected]",
>         "Message-Recipient-Address": "[email protected]",
>         "Message-To": [
>             "[email protected]",
>             "[email protected]"
>         ],
>         "Message:From-Email": "[email protected]",
>         "Message:Raw-Header:MIME-Version": "1.0",
>         "Message:Raw-Header:Return-Path": "<[email protected]>",
>         "Multipart-Boundary": "===============6812308677685932777==",
>         "Multipart-Subtype": "mixed",
>         "X-Parsed-By": [
>             "org.apache.tika.parser.DefaultParser",
>             "org.apache.tika.parser.mail.RFC822Parser"
>         ],
>         "X-TIKA:embedded_resource_path": "/embedded-1",
>         "X-TIKA:parse_time_millis": "124",
>         "creator": [
>             "[email protected]",
>             "[email protected]"
>         ],
>         "dc:creator": [
>             "[email protected]",
>             "[email protected]"
>         ],
>         "dc:format": "application/pdf",
>         "dc:title": "Side question local book claim.",
>         "format": "application/pdf",
>         "meta:author": [
>             "[email protected]",
>             "[email protected]"
>         ],
>         "subject": "Side question local book claim."
>     }
> ]{noformat}
>  
> *Attached Files*
>  # The customer's original mbox file 
> ([email protected])
>  # The base64 encoded PDF in it's own file (base64File.txt)
>  # The extracted PDF standalone (As_Cool_as_I_Am_(film).pdf)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to