[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: 114032807362001301.gif 114032807362001201.gif 114032807362001101.gif 114032807362001001.gif 114032807362000901.gif 114032807362000801.gif Hi Tim I think at least one of my test files uses a Package to wrap the object, so should be useful. I also am continuing to search for one containing a \pict with binary encoding. Cheers Chris Chris Bamford Senior Developer m: +44 7860 405292 p: +44 207 847 8700 w: www.mimecast.com Address click here: www.mimecast.com/About-us/Contact-us/ Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: 114032807362000801.gif, 114032807362000901.gif, 114032807362001001.gif, 114032807362001101.gif, 114032807362001201.gif, 114032807362001301.gif, ExampleRTFs.zip, outer.rtf When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-93) OCR support
[ https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950487#comment-13950487 ] Timo Boehme commented on TIKA-93: - Hi Anurag, which PDF are you referring to? Without knowing the size, page count and structure of the pages it is hard to say what is going wrong. For instance it could be as I already wrote in my last comment that the pages contain a large number of images (e.g. one per word or chunk) instead of a single one per page. Try to print the PDF to images (one per page) and run this through Tesseract. OCR support --- Key: TIKA-93 URL: https://issues.apache.org/jira/browse/TIKA-93 Project: Tika Issue Type: New Feature Components: parser Reporter: Jukka Zitting Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.6 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, testOCR.docx, testOCR.pdf, testOCR.pptx I don't know of any decent open source pure Java OCR libraries, but there are command line OCR tools like Tesseract (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to extract text content (where available) from image files. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: PDF parser (two more questions)
Hi Jukka, thanks a lot for your reply. On #1 I am still wondering why for indexing we need structure information. is there any particular reason? wouldn't make more sense to get just the text by default and only optionally getting the structure? On #2, I expected the code you presented would not work. And in fact the pattern is quite odd, isn't it? What is the reason of throwing the exception if limiting the text read is a legal use case? (I am asking just to understand the background). Ste Ste On Thu, Mar 27, 2014 at 11:55 PM, Jukka Zitting jukka.zitt...@gmail.comwrote: Hi, On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari stefano.forn...@gmail.com wrote: 1. is the use of PDF2XHTML necessary? why is the pdf turned into an XHTML? for the purpose of indexing, wouldn't just the text be enough? The XHTML output allows us to annotate the extracted text with structural information (like this is a heading, here's a hyperlink, etc.) that would be difficult to express with text-only output. A client that needs just the text content can get it easily with the BodyContentHandler class. 2. I need to limit the index of the content to files whose size is below to a certain threshold; I was wondering if this could be a parser configuration option and thus if you would accept this change. Do you want to entirely exclude too large files, or just index the first few pages of such files (which is more common in many indexing use cases)? The latter use case be implemented with the writeLimit parameter of the WriteOutContentHandler class, like this: // Extract up to 100k characters from a given document WriteOutContentHandler out = new WriteOutContentHandler(100_000); try { parser.parse(..., new BodyContentHandler(out), ...); } catch (SAXException e) { if (!out.isWriteLimitReached(e)) { throw e; } } String content = out.toString(); BR, Jukka Zitting
[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950461#comment-13950461 ] Chris Bamford edited comment on TIKA-1010 at 3/28/14 9:49 AM: -- Hi Tim I think at least one of my test files uses a Package to wrap the object, so should be useful. I also am continuing to search for one containing a \pict with binary encoding. Cheers Chris was (Author: bammers): Hi Tim I think at least one of my test files uses a Package to wrap the object, so should be useful. I also am continuing to search for one containing a \pict with binary encoding. Cheers Chris Chris Bamford Senior Developer m: +44 7860 405292 p: +44 207 847 8700 w: www.mimecast.com Address click here: www.mimecast.com/About-us/Contact-us/ Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362001301.gif) Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362001001.gif) Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362000801.gif) Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362000901.gif) Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bamford updated TIKA-1010: Attachment: (was: 114032807362001201.gif) Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: How should video files with audio be handled by parsers?
I think you should have three info blocks: video streams, audio streams and subtitles (if container supports their embedding). Sort naturally or by vid/aid/sid if present. You shouldn't multiplex video and audio streams since any video stream can be combined with any audio stream. In terms of xml you can have container as root element, which embeds streams grouped by type. -- Best regards, Konstantin Gribov. 28.03.2014 1:29 пользователь Nick Burch apa...@gagravarr.org написал: On Thu, 27 Mar 2014, Konstantin Gribov wrote: Some containers (like matroska/mkv) tags audio and subtitle streams with language tag and some comment. From mplayer console output: [lavf] stream 0: video (h264), -vid 0 [lavf] stream 1: audio (aac), -aid 0, -alang rus, Rus BaibaKo.tv [lavf] stream 2: audio (ac3), -aid 1, -alang eng, Eng Ogg + CMML would give something similar I don't know any established semantics for video streams but the first usually is default for playback. How should a Tika parser handle such a file though? Include the primary audio metadata with the video stream as the primary object, and report embedded items for the other audio streams? Report all as embedded items? Report the primary video stream as the main thing, and give all other video + audio as embedded items? Something else? Nick
Re: PDF parser (two more questions)
Exception is rethrown only if write limit not reached. So if exception was on first 100k chars it affects the result. If exception is thrown after that -- it will be suppressed. -- Best regards, Konstantin Gribov. 28.03.2014 13:32 пользователь Stefano Fornari stefano.forn...@gmail.com написал: Hi Jukka, thanks a lot for your reply. On #1 I am still wondering why for indexing we need structure information. is there any particular reason? wouldn't make more sense to get just the text by default and only optionally getting the structure? On #2, I expected the code you presented would not work. And in fact the pattern is quite odd, isn't it? What is the reason of throwing the exception if limiting the text read is a legal use case? (I am asking just to understand the background). Ste Ste On Thu, Mar 27, 2014 at 11:55 PM, Jukka Zitting jukka.zitt...@gmail.com wrote: Hi, On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari stefano.forn...@gmail.com wrote: 1. is the use of PDF2XHTML necessary? why is the pdf turned into an XHTML? for the purpose of indexing, wouldn't just the text be enough? The XHTML output allows us to annotate the extracted text with structural information (like this is a heading, here's a hyperlink, etc.) that would be difficult to express with text-only output. A client that needs just the text content can get it easily with the BodyContentHandler class. 2. I need to limit the index of the content to files whose size is below to a certain threshold; I was wondering if this could be a parser configuration option and thus if you would accept this change. Do you want to entirely exclude too large files, or just index the first few pages of such files (which is more common in many indexing use cases)? The latter use case be implemented with the writeLimit parameter of the WriteOutContentHandler class, like this: // Extract up to 100k characters from a given document WriteOutContentHandler out = new WriteOutContentHandler(100_000); try { parser.parse(..., new BodyContentHandler(out), ...); } catch (SAXException e) { if (!out.isWriteLimitReached(e)) { throw e; } } String content = out.toString(); BR, Jukka Zitting
Re: PDF parser (two more questions)
Yes, got it. Which is a strange use case: if I set the limit, first I would not expect an exception (which represents an unexpected error condition); secondly, I would not expect to rethrow it only under certain conditions. I understood the trick, but I am trying to understand this is done in this way (that at a first glance does not seem clean).
Re: PDF parser (two more questions)
On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari stefano.forn...@gmail.com wrote: I understood the trick, but I am trying to understand this is done in this way (that at a first glance does not seem clean). ... trying to understand why this is done in this way...
Re: PDF parser (two more questions)
SAXException is checked, so you have to catch it or add to method throws list (or javac wouldn't compile it). Tika usually rethrows exceptions enveloping them into TikaException. In case of code above method throws SAXException. Suppressing the exception is done to avoid parser fail after parsing valuable amount of data. -- Best regards, Konstantin Gribov. 28.03.2014 14:27 пользователь Stefano Fornari stefano.forn...@gmail.com написал: On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari stefano.forn...@gmail.com wrote: I understood the trick, but I am trying to understand this is done in this way (that at a first glance does not seem clean). ... trying to understand why this is done in this way...
Re: How should video files with audio be handled by parsers?
On Fri, 28 Mar 2014, Konstantin Gribov wrote: I think you should have three info blocks: video streams, audio streams and subtitles (if container supports their embedding). Sort naturally or by vid/aid/sid if present. That's not something Tika supports though. We have a metadata object we can populate with some things, or we can trigger for embedded objects. The Metadata object doesn't support nesting Nick
Re: PDF parser (two more questions)
well, I should look at the code, I can't do it now, but I guess my point is that BodyContentHandler should not throw the exception (and most probably not a SAXException in any case) in the case the limit is reached. This means that the limit should not put on the WriteOutContentHandler, but on BodyContentHandler. Ste On Fri, Mar 28, 2014 at 11:52 AM, Konstantin Gribov gros...@gmail.comwrote: SAXException is checked, so you have to catch it or add to method throws list (or javac wouldn't compile it). Tika usually rethrows exceptions enveloping them into TikaException. In case of code above method throws SAXException. Suppressing the exception is done to avoid parser fail after parsing valuable amount of data. -- Best regards, Konstantin Gribov. 28.03.2014 14:27 пользователь Stefano Fornari stefano.forn...@gmail.com написал: On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari stefano.forn...@gmail.com wrote: I understood the trick, but I am trying to understand this is done in this way (that at a first glance does not seem clean). ... trying to understand why this is done in this way...
Re: PDF parser (two more questions)
All such handlers are implementation of org.xml.sax.ContentHandler interface, so thier methods throws SAXException. But in code above none of contentHandler methods are invoked (only in parser.parse where content handler is passed). You can take a look at org.apache.tika.Tika.parseToString(InputSteam, Metadata, int) as a reference. It has code similar to Jukka's code above. -- Best regards, Konstantin Gribov. 2014-03-28 15:47 GMT+04:00 Stefano Fornari stefano.forn...@gmail.com: well, I should look at the code, I can't do it now, but I guess my point is that BodyContentHandler should not throw the exception (and most probably not a SAXException in any case) in the case the limit is reached. This means that the limit should not put on the WriteOutContentHandler, but on BodyContentHandler. Ste On Fri, Mar 28, 2014 at 11:52 AM, Konstantin Gribov gros...@gmail.com wrote: SAXException is checked, so you have to catch it or add to method throws list (or javac wouldn't compile it). Tika usually rethrows exceptions enveloping them into TikaException. In case of code above method throws SAXException. Suppressing the exception is done to avoid parser fail after parsing valuable amount of data. -- Best regards, Konstantin Gribov. 28.03.2014 14:27 пользователь Stefano Fornari stefano.forn...@gmail.com написал: On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari stefano.forn...@gmail.com wrote: I understood the trick, but I am trying to understand this is done in this way (that at a first glance does not seem clean). ... trying to understand why this is done in this way...
Re: How should video files with audio be handled by parsers?
I said it about output to content handler, not to metadata. How to handle metadata for containers with several video streams is another problem. Tika metadata model is something weird for me, so I try to do not look at it too often =) -- Best regards, Konstantin Gribov. 2014-03-28 14:59 GMT+04:00 Nick Burch apa...@gagravarr.org: On Fri, 28 Mar 2014, Konstantin Gribov wrote: I think you should have three info blocks: video streams, audio streams and subtitles (if container supports their embedding). Sort naturally or by vid/aid/sid if present. That's not something Tika supports though. We have a metadata object we can populate with some things, or we can trigger for embedded objects. The Metadata object doesn't support nesting Nick
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950628#comment-13950628 ] Tim Allison commented on TIKA-1010: --- Chris, Thank you for digging into the spec and sharing test files. For some reason, I can't find the gifs that JIRA reported you attaching earlier today. Y, I'm in search of a binary test file. Please share one if you can find it. I think I'm good on package files. I'll attach my two main test files shortly. I've tested on the files within the zip that you submitted earlier, and all is good. Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1010: -- Attachment: testRTFRegularImages.rtf This is an example of regular images -- pict -- not embedded data. Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1010: -- Attachment: testRTF_embbededFiles.zip This is the test file I'll use to test poifs package and embedded object formatted data. Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950714#comment-13950714 ] Chris Bamford commented on TIKA-1010: - Hi Tim Sorry about the confusion with the GIFs - they were nothing to do with the case! Thy were part of my email footer which Jira automatically attached to the ticket when I replied by email! So I removed them. Sounds like you're making great progress. I will provide a binary pict file as soon as I can locate one. Cheers, - Chris Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (TIKA-1244) Better parsing of Mbox files
[ https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen reassigned TIKA-1244: -- Assignee: Hong-Thai Nguyen Better parsing of Mbox files Key: TIKA-1244 URL: https://issues.apache.org/jira/browse/TIKA-1244 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.5 Reporter: Luis Filipe Nassif Assignee: Hong-Thai Nguyen Attachments: MboxParser.java.patch MboxParser currently looses metadata of all emails, except first. It does not extract/parse emails, nor decode parts. It should handle embedded emails like other container parsers do, so emails will be automatically parsed by RFC822Parser. I will try to add a patch for this. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: metadata key for original file path?
On Fri, 28 Mar 2014, Allison, Timothy B. wrote: In working on TIKA-1010, there are some cases where the full original file path is stored with an image or embedded document. TikaMetadatakeys.RESOURCE_NAME_KEY should be used for file name (right?), but what should I use for file path? I can only suggest looking at what the zip (+ other archive formats) code does, that should be a good guide to embedded resources where we know the name of the resource Nick
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951009#comment-13951009 ] Chris Bamford commented on TIKA-1010: - Hi again Tim Dunno if this helps, but there is a generic RTF parser kit on GitHub (https://github.com/joniles/rtfparserkit) which knows how to navigate RTFs. I'm playing with it now - it doesn't do much except parse but it might provide some insight? Cheers, - Chris Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004 ] Tim Allison edited comment on TIKA-1010 at 3/28/14 4:47 PM: Chris, Thanks for pointing that out. The objdata in logo.rtf is of type pbrush (bitmap) but it is encoded in the regular hexpairs. I just added handling for that. The thumbnail/result/metafile is binary, but I've chosen not to extract thumbnails/emf or other meta-embeddings. This ok? was (Author: talli...@mitre.org): Chris, Thanks for pointing that out. The objdata in logo.rtf is of type pbrush (bitmap) but it is encoded in the regular hexpairs. I just added handling for that. I'm not sure that that is what is meant by the binary pict type you pointed out in the spec. {noformat}An RTF file can include pictures created with other applications. These pictures can be in hexadecimal (the default) or binary format. Pictures are destinations, and begin with the \pict control word. The \pict keyword is preceded by \*\shppict destination control keyword as described in the following example. A picture destination has the following syntax: pict '{' \pict (brdr? shading? picttype pictsize metafileinfo?) data '}' picttype | \emfblip | \pngblip | \jpegblip | \macpict | \pmmetafile | \wmetafile | \dibitmap bitmapinfo | \wbitmap bitmapinfo bitmapinfo \wbmbitspixel \wbmplanes \wbmwidthbytes pictsize (\picw \pich) \picwgoal? \pichgoal? \picscalex? \picscaley? \picscaled? \piccropt? \piccropb? \piccropr? \piccropl? metafileinfo \picbmp \picbpp data (\bin #BDATA) | #SDATA {noformat} My guess from that is that we'd see something like: {noformat} {pict ...\bin 0101000100010001 {noformat} Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004 ] Tim Allison commented on TIKA-1010: --- Chris, Thanks for pointing that out. The objdata in logo.rtf is of type pbrush (bitmap) but it is encoded in the regular hexpairs. I just added handling for that. I'm not sure that that is what is meant by the binary pict type you pointed out in the spec. {noformat}An RTF file can include pictures created with other applications. These pictures can be in hexadecimal (the default) or binary format. Pictures are destinations, and begin with the \pict control word. The \pict keyword is preceded by \*\shppict destination control keyword as described in the following example. A picture destination has the following syntax: pict '{' \pict (brdr? shading? picttype pictsize metafileinfo?) data '}' picttype | \emfblip | \pngblip | \jpegblip | \macpict | \pmmetafile | \wmetafile | \dibitmap bitmapinfo | \wbitmap bitmapinfo bitmapinfo \wbmbitspixel \wbmplanes \wbmwidthbytes pictsize (\picw \pich) \picwgoal? \pichgoal? \picscalex? \picscaley? \picscaled? \piccropt? \piccropb? \piccropr? \piccropl? metafileinfo \picbmp \picbpp data (\bin #BDATA) | #SDATA {noformat} My guess from that is that we'd see something like: {noformat} {pict {objdata \bin 0101000100010001 {noformat} Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21
[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004 ] Tim Allison edited comment on TIKA-1010 at 3/28/14 4:44 PM: Chris, Thanks for pointing that out. The objdata in logo.rtf is of type pbrush (bitmap) but it is encoded in the regular hexpairs. I just added handling for that. I'm not sure that that is what is meant by the binary pict type you pointed out in the spec. {noformat}An RTF file can include pictures created with other applications. These pictures can be in hexadecimal (the default) or binary format. Pictures are destinations, and begin with the \pict control word. The \pict keyword is preceded by \*\shppict destination control keyword as described in the following example. A picture destination has the following syntax: pict '{' \pict (brdr? shading? picttype pictsize metafileinfo?) data '}' picttype | \emfblip | \pngblip | \jpegblip | \macpict | \pmmetafile | \wmetafile | \dibitmap bitmapinfo | \wbitmap bitmapinfo bitmapinfo \wbmbitspixel \wbmplanes \wbmwidthbytes pictsize (\picw \pich) \picwgoal? \pichgoal? \picscalex? \picscaley? \picscaled? \piccropt? \piccropb? \piccropr? \piccropl? metafileinfo \picbmp \picbpp data (\bin #BDATA) | #SDATA {noformat} My guess from that is that we'd see something like: {noformat} {pict ...\bin 0101000100010001 {noformat} was (Author: talli...@mitre.org): Chris, Thanks for pointing that out. The objdata in logo.rtf is of type pbrush (bitmap) but it is encoded in the regular hexpairs. I just added handling for that. I'm not sure that that is what is meant by the binary pict type you pointed out in the spec. {noformat}An RTF file can include pictures created with other applications. These pictures can be in hexadecimal (the default) or binary format. Pictures are destinations, and begin with the \pict control word. The \pict keyword is preceded by \*\shppict destination control keyword as described in the following example. A picture destination has the following syntax: pict '{' \pict (brdr? shading? picttype pictsize metafileinfo?) data '}' picttype | \emfblip | \pngblip | \jpegblip | \macpict | \pmmetafile | \wmetafile | \dibitmap bitmapinfo | \wbitmap bitmapinfo bitmapinfo \wbmbitspixel \wbmplanes \wbmwidthbytes pictsize (\picw \pich) \picwgoal? \pichgoal? \picscalex? \picscaley? \picscaled? \piccropt? \piccropb? \piccropr? \piccropl? metafileinfo \picbmp \picbpp data (\bin #BDATA) | #SDATA {noformat} My guess from that is that we'd see something like: {noformat} {pict {objdata \bin 0101000100010001 {noformat} Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951097#comment-13951097 ] Chris Bamford commented on TIKA-1010: - The binary actually looks like this: {noformat} {pict {objdata \bin270141 . {noformat} The 270,141 after \bin is the number of bytes to read (size of blob). Note that the blob can (and does in this case!) contain '}' which is not the group end marker but actual 'binary' data. Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951092#comment-13951092 ] Chris Bamford commented on TIKA-1010: - Ideally I'd like to be able to extract any file, but let's get the main cases covered off first! Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951107#comment-13951107 ] Tim Allison commented on TIKA-1010: --- Y, thanks, I got that. I can add an extract all mode vs extract logical. I'll set the default to extract logical unless there are objections from the community. I now have success against the two files I posted earlier today. When I add the extract all vs logical parameter, I can use Mike's testBinControlWord (from TIKA-782) as a test. Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951116#comment-13951116 ] Tim Allison commented on TIKA-1010: --- As a side note, I can grab file names for: 1) images that are regular images (not embedded objdata) 2) non-POIFS embedded files (html, embedded objdata images, zip, msg) I can't find file names for: xls, xlsx, doc, docx, ppt, pptx, pdf Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
How to exclude a mimetype form being indexed in solr using tika?
Good afternoon, I already asked this question in the solr - user forum and I didn't get anywhere. They suggested I ask the tika community... I'm using solr 4.0 Final I need movies hidden in zip files that need to be excluded from the index. I can't filter movies on the crawler because then I would have to exclude all zip files. I was told I can have tika skip the movies. the details are escaping me at this point. How do I exclude a file in the tika configuration? I assume it's something I add in the update/extract handler but I'm not sure. Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-form-being-indexed-in-solr-using-tika-tp4127767.html Sent from the Apache Tika - Development mailing list archive at Nabble.com.
[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted
[ https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950835#comment-13950835 ] Chris Bamford commented on TIKA-1010: - Hi Tim I have found one - please see https://issues.apache.org/jira/browse/TIKA-782 (logo.rtf inside logo.zip). Best, - Chris Embedded documents in RTF are not extracted --- Key: TIKA-1010 URL: https://issues.apache.org/jira/browse/TIKA-1010 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Assignee: Tim Allison Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip When an RTF doc embeds a doc it looks like this: {noformat} {\object\objemb \objw628\objh765{\*\objclass Package}{\*\objdata 0105020008005061636b616765006600 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400 5404bbfaee00080054044505 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300} {noformat} But, unfortunately, the format of those hex bytes is not spelled out in the RTF spec ... the spec merely says the bytes are saved by the OLESaveToStream function ... and I haven't been able to find a description of what the bytes mean. In this case they are a Package object (\objclass Package), which I think is an [old?] way to wrap any non-OLE file (this is just a .txt file). Here's the hex dump: {noformat} 01 05 00 00 02 00 00 00 08 00 00 00 50 61 63 6b |Pack| 0010 61 67 65 00 00 00 00 00 00 00 00 00 66 00 00 00 |age.f...| 0020 02 00 48 77 2e 74 78 74 00 43 3a 5c 44 4f 43 55 |..Hw.txt.C:\DOCU| 0030 4d 45 7e 31 5c 69 67 61 6c 73 68 5c 44 65 73 6b |ME~1\igalsh\Desk| 0040 74 6f 70 5c 48 57 2e 74 78 74 00 00 00 03 00 22 |top\HW.txt.| 0050 00 00 00 43 3a 5c 44 4f 43 55 4d 45 7e 31 5c 69 |...C:\DOCUME~1\i| 0060 67 61 6c 73 68 5c 44 65 73 6b 74 6f 70 5c 48 57 |galsh\Desktop\HW| 0070 2e 74 78 74 00 0b 00 00 00 48 65 6c 6c 6f 20 57 |.txt.Hello W| 0080 6f 72 6c 64 00 00 01 05 00 00 05 00 00 00 0d 00 |orld| 0090 00 00 4d 45 54 41 46 49 4c 45 50 49 43 54 00 54 |..METAFILEPICT.T| 00a0 04 00 00 bb fa ff ff ee 00 00 00 08 00 54 04 45 |.T.E| 00b0 05 00 00 01 00 09 00 00 03 73 00 00 00 02 00 1c |.s..| 00c0 00 00 00 00 00 05 00 00 00 0b 02 00 00 00 00 05 || 00d0 00 00 00 0c 02 32 00 29 00 1c 00 00 00 fb 02 f5 |.2.)| 00e0 ff 00 00 00 00 00 00 90 01 00 00 00 01 00 00 00 || 00f0 00 54 61 68 6f 6d 61 00 00 55 17 0a 70 00 fc 07 |.Tahoma..U..p...| 0100 00 58 b1 f3 77 61 b1 f3 77 20 40 f5 77 49 36 66 |.X..wa..w @.wI6f| 0110 83 04 00 00 00 2d 01 00 00 05 00 00 00 09 02 00 |.-..| 0120 00 00 00 05 00 00 00 01 02 ff ff ff 00 05 00 00 || 0130 00 02 01 01 00 00 00 05 00 00 00 2e 01 06 00 00 || 0140 00 09 00 00 00 21 05 06 00 48 77 2e 74 78 74 21 |.!...Hw.txt!| 0150 00 15 00 1c 00 00 00 fb 02 10 00 07 00 00 00 00 || 0160 00 bc 02 00 00 00 00 01 02 02 22 53 79 73 74 65 |..Syste| 0170 6d 00 00 49 36 66 83 00 00 0a 00 26 00 8a 01 00 |m..I6f.| 0180 00 00 00 ff ff ff ff 8c fc 07 00 04 00 00 00 2d |...-| 0190 01 01 00 03 00 00 00 00 00 |.| 0199 {noformat} Anyway I have no idea how to decode the bytes at this point ... just opening the issue in case anyone else does! -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: How to exclude a mimetype form being indexed in solr using tika?
On Fri, 28 Mar 2014, eShard wrote: I'm using solr 4.0 Final I need movies hidden in zip files that need to be excluded from the index. I can't filter movies on the crawler because then I would have to exclude all zip files. If you're calling Tika directly, this is very easy. When tika hits embedded resources, it'll call out to your code, and you can select then if you want to process each one or ignore each one (This is all done via an EmbeddedDocumentExtractor, which you supply on the ParseContext) How do I exclude a file in the tika configuration? I assume it's something I add in the update/extract handler but I'm not sure. I've no idea how / if you can tell the SOLR code to ask Tika to do that or not, that's something you'll have to go back to the SOLR community about as they maintain that code Nick