[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13955048#comment-13955048 ] Chris Bamford commented on TIKA-1010: - Hi Tim I have created an RTF with 5 embedded office docs in but it is too large to attach to the ticket (10mb). If you're interested we'll have to find some other way to get it to you ... Chris Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1266) Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox
pm created TIKA-1266: Summary: Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox Key: TIKA-1266 URL: https://issues.apache.org/jira/browse/TIKA-1266 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 1.5, 1.4 Reporter: pm The tika-bundle currently has the Embed-Dependency header filled with embedded dependencies. Embed-Dependency is not defined in OSGI spec, Bundle-ClassPath is . Please add Bundle-ClassPath with list of embedded JAR names prefixed with ., . -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13955104#comment-13955104 ] Tim Allison commented on TIKA-1010: --- Thank you, Chris. The test doc that I posted late last week has examples of embedded (doc|ppt|xls)x? (testRTF_embbededFiles.zip). The issue is that with those files (and pdfs), I don't think that the name of the file is stored in any metadata in the file (nor in the POIFS embobj itself). I just wanted to send a heads up about this apparent limitation in RTF...there are two other places that I still need to check to confirm this...one is the {\result {\shppict}} that immediately follows the embobj (this is the thumbnail pict), and the other is in the \nonshppict that can also follow an embobj. It is entirely possible that the info is stored in POIFS or elsewhere, and I'm just not seeing it. I did add processing to pull the name of the file from: 1) the embobj header if it exists (which it seems to for files not including MS/PDF) and 2) from the pict's metadata {sp {sn}{sv}}. I think I've changed my mind on handling thumbnails separately. My current plan is to extract all embedded data. I'll add info to the Metadata obj about whether the file is an embobj, or a thumbnail {\result {\shppict }} or {\result {\nonshppict}}. The client code can then decide what to do with the embedded data. I'll add binary processing (thank you for your pointer to TIKA-782!), and post a draft of the patch late this afternoon or tomorrow. Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01
[jira] [Resolved] (TIKA-1244) Better parsing of Mbox files
[ https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1244. Resolution: Fixed Fix Version/s: 1.6 Commited on r1583305, thanks [~lfcnassif] I preserved metadata extraction from current MboxParser because message/rfc822 seems not enable extract all fields in header. Better parsing of Mbox files Key: TIKA-1244 URL: https://issues.apache.org/jira/browse/TIKA-1244 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Luis Filipe Nassif Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: MboxParser.java.patch MboxParser currently looses metadata of all emails, except first. It does not extract/parse emails, nor decode parts. It should handle embedded emails like other container parsers do, so emails will be automatically parsed by RFC822Parser. I will try to add a patch for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: How to exclude a mimetype form being indexed in solr using tika?
Thanks for the reply, I'll investigate the EmbeddedDocumentExtractor the solr community told me it is a tika issue and the tika community told me it's a solr issue... Oh boy... :( -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-from-being-indexed-in-solr-using-tika-tp4127767p4128188.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
[jira] [Commented] (TIKA-1244) Better parsing of Mbox files
[ https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13955221#comment-13955221 ] Luis Filipe Nassif commented on TIKA-1244: -- Thank you, [~thaichat04]. I think the metadata tracking feature is very useful! Better parsing of Mbox files Key: TIKA-1244 URL: https://issues.apache.org/jira/browse/TIKA-1244 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Luis Filipe Nassif Assignee: Hong-Thai Nguyen Fix For: 1.6 Attachments: MboxParser.java.patch MboxParser currently looses metadata of all emails, except first. It does not extract/parse emails, nor decode parts. It should handle embedded emails like other container parsers do, so emails will be automatically parsed by RFC822Parser. I will try to add a patch for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (TIKA-1267) Improve Mbox file detection
Luis Filipe Nassif created TIKA-1267: Summary: Improve Mbox file detection Key: TIKA-1267 URL: https://issues.apache.org/jira/browse/TIKA-1267 Project: Tika Issue Type: Improvement Components: mime Affects Versions: 1.5 Reporter: Luis Filipe Nassif Priority: Minor Could we add to application/mbox mime-type definition code below: {code} magic priority=70 match value=From type=string offset=0/ /magic {code} Or is it too common out there? -- This message was sent by Atlassian JIRA (v6.2#6252)