[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bamford updated TIKA-1010:


Attachment: 114032807362001301.gif
114032807362001201.gif
114032807362001101.gif
114032807362001001.gif
114032807362000901.gif
114032807362000801.gif

Hi Tim

I think at least one of my test files uses a Package to wrap the object, so 
should be useful.
I also am continuing to search for one containing a \pict with binary encoding.
Cheers

Chris

Chris Bamford
Senior Developer
m: +44 7860 405292
p: +44 207 847 8700
w: www.mimecast.com
Address click here: www.mimecast.com/About-us/Contact-us/







 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: 114032807362000801.gif, 114032807362000901.gif, 
 114032807362001001.gif, 114032807362001101.gif, 114032807362001201.gif, 
 114032807362001301.gif, ExampleRTFs.zip, outer.rtf


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-93) OCR support

2014-03-28 Thread Timo Boehme (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950487#comment-13950487
 ] 

Timo Boehme commented on TIKA-93:
-

Hi Anurag, which PDF are you referring to? Without knowing the size, page count 
and structure of the pages it is hard to say what is going wrong. For instance 
it could be as I already wrote in my last comment that the pages contain a 
large number of images (e.g. one per word or chunk) instead of a single one per 
page. Try to print the PDF to images (one per page) and run this through 
Tesseract.

 OCR support
 ---

 Key: TIKA-93
 URL: https://issues.apache.org/jira/browse/TIKA-93
 Project: Tika
  Issue Type: New Feature
  Components: parser
Reporter: Jukka Zitting
Assignee: Chris A. Mattmann
Priority: Minor
 Fix For: 1.6

 Attachments: TIKA-93.patch, TIKA-93.patch, TIKA-93.patch, 
 TIKA-93.patch, TesseractOCRParser.patch, TesseractOCRParser.patch, 
 testOCR.docx, testOCR.pdf, testOCR.pptx


 I don't know of any decent open source pure Java OCR libraries, but there are 
 command line OCR tools like Tesseract 
 (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
 extract text content (where available) from image files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
Hi Jukka,
thanks a lot for your reply.

On #1 I am still wondering why for indexing we need structure information.
is there any particular reason? wouldn't make more sense to get just the
text by default and only optionally getting the structure?

On #2, I expected the code you presented would not work. And in fact the
pattern is quite odd, isn't it? What is the reason of throwing the
exception if limiting the text read is a legal use case? (I am asking just
to understand the background).

Ste

Ste


On Thu, Mar 27, 2014 at 11:55 PM, Jukka Zitting jukka.zitt...@gmail.comwrote:

 Hi,

 On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari
 stefano.forn...@gmail.com wrote:
  1. is the use of PDF2XHTML necessary? why is the pdf turned into an
 XHTML?
  for the purpose of indexing, wouldn't just the text be enough?

 The XHTML output allows us to annotate the extracted text with
 structural information (like this is a heading, here's a
 hyperlink, etc.) that would be difficult to express with text-only
 output. A client that needs just the text content can get it easily
 with the BodyContentHandler class.

  2. I need to limit the index of the content to files whose size is below
 to
  a certain threshold; I was wondering if this could be a parser
  configuration option and thus if you would accept this change.

 Do you want to entirely exclude too large files, or just index the
 first few pages of such files (which is more common in many indexing
 use cases)?

 The latter use case be implemented with the writeLimit parameter of
 the WriteOutContentHandler class, like this:

 // Extract up to 100k characters from a given document
 WriteOutContentHandler out = new WriteOutContentHandler(100_000);
 try {
 parser.parse(..., new BodyContentHandler(out), ...);
 } catch (SAXException e) {
 if (!out.isWriteLimitReached(e)) {
 throw e;
 }
 }
 String content = out.toString();

 BR,

 Jukka Zitting



[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950461#comment-13950461
 ] 

Chris Bamford edited comment on TIKA-1010 at 3/28/14 9:49 AM:
--

Hi Tim

I think at least one of my test files uses a Package to wrap the object, so 
should be useful.
I also am continuing to search for one containing a \pict with binary encoding.
Cheers

Chris









was (Author: bammers):
Hi Tim

I think at least one of my test files uses a Package to wrap the object, so 
should be useful.
I also am continuing to search for one containing a \pict with binary encoding.
Cheers

Chris

Chris Bamford
Senior Developer
m: +44 7860 405292
p: +44 207 847 8700
w: www.mimecast.com
Address click here: www.mimecast.com/About-us/Contact-us/







 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bamford updated TIKA-1010:


Attachment: (was: 114032807362001301.gif)

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bamford updated TIKA-1010:


Attachment: (was: 114032807362001001.gif)

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bamford updated TIKA-1010:


Attachment: (was: 114032807362000801.gif)

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bamford updated TIKA-1010:


Attachment: (was: 114032807362000901.gif)

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bamford updated TIKA-1010:


Attachment: (was: 114032807362001201.gif)

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: How should video files with audio be handled by parsers?

2014-03-28 Thread Konstantin Gribov
I think you should have three info blocks: video streams, audio streams and
subtitles (if container supports their embedding). Sort naturally or by
vid/aid/sid if present.

You shouldn't multiplex video and audio streams since any video stream can
be combined with any audio stream.

In terms of xml you can have container as root element, which embeds
streams grouped by type.

-- 
Best regards,
Konstantin Gribov.
28.03.2014 1:29 пользователь Nick Burch apa...@gagravarr.org написал:

 On Thu, 27 Mar 2014, Konstantin Gribov wrote:

 Some containers (like matroska/mkv) tags audio and subtitle streams with
 language tag and some comment. From mplayer console output:

  [lavf] stream 0: video (h264), -vid 0
 [lavf] stream 1: audio (aac), -aid 0, -alang rus, Rus BaibaKo.tv
 [lavf] stream 2: audio (ac3), -aid 1, -alang eng, Eng


 Ogg + CMML would give something similar

  I don't know any established semantics for video streams but the first
 usually is default for playback.


 How should a Tika parser handle such a file though? Include the primary
 audio metadata with the video stream as the primary object, and report
 embedded items for the other audio streams? Report all as embedded items?
 Report the primary video stream as the main thing, and give all other video
 + audio as embedded items? Something else?

 Nick



Re: PDF parser (two more questions)

2014-03-28 Thread Konstantin Gribov
Exception is rethrown only if write limit not reached. So if exception was
on first 100k chars it affects the result. If exception is thrown after
that -- it will be suppressed.

-- 
Best regards,
Konstantin Gribov.
28.03.2014 13:32 пользователь Stefano Fornari stefano.forn...@gmail.com
написал:

 Hi Jukka,
 thanks a lot for your reply.

 On #1 I am still wondering why for indexing we need structure information.
 is there any particular reason? wouldn't make more sense to get just the
 text by default and only optionally getting the structure?

 On #2, I expected the code you presented would not work. And in fact the
 pattern is quite odd, isn't it? What is the reason of throwing the
 exception if limiting the text read is a legal use case? (I am asking just
 to understand the background).

 Ste

 Ste


 On Thu, Mar 27, 2014 at 11:55 PM, Jukka Zitting jukka.zitt...@gmail.com
 wrote:

  Hi,
 
  On Thu, Mar 27, 2014 at 6:21 PM, Stefano Fornari
  stefano.forn...@gmail.com wrote:
   1. is the use of PDF2XHTML necessary? why is the pdf turned into an
  XHTML?
   for the purpose of indexing, wouldn't just the text be enough?
 
  The XHTML output allows us to annotate the extracted text with
  structural information (like this is a heading, here's a
  hyperlink, etc.) that would be difficult to express with text-only
  output. A client that needs just the text content can get it easily
  with the BodyContentHandler class.
 
   2. I need to limit the index of the content to files whose size is
 below
  to
   a certain threshold; I was wondering if this could be a parser
   configuration option and thus if you would accept this change.
 
  Do you want to entirely exclude too large files, or just index the
  first few pages of such files (which is more common in many indexing
  use cases)?
 
  The latter use case be implemented with the writeLimit parameter of
  the WriteOutContentHandler class, like this:
 
  // Extract up to 100k characters from a given document
  WriteOutContentHandler out = new WriteOutContentHandler(100_000);
  try {
  parser.parse(..., new BodyContentHandler(out), ...);
  } catch (SAXException e) {
  if (!out.isWriteLimitReached(e)) {
  throw e;
  }
  }
  String content = out.toString();
 
  BR,
 
  Jukka Zitting
 



Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
Yes, got it. Which is a strange use case: if I set the limit, first I would
not expect an exception (which represents an unexpected error condition);
secondly, I would not expect to rethrow it only under certain conditions. I
understood the trick, but I am trying to understand this is done in this
way (that at a first glance does not seem clean).


Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari stefano.forn...@gmail.com
 wrote:

 I understood the trick, but I am trying to understand this is done in this
 way (that at a first glance does not seem clean).

 ... trying to understand why this is done in this way...


Re: PDF parser (two more questions)

2014-03-28 Thread Konstantin Gribov
SAXException is checked, so you have to catch it or add to method throws
list (or javac wouldn't compile it). Tika usually rethrows exceptions
enveloping them into TikaException. In case of code above method throws
SAXException.

Suppressing the exception is done to avoid parser fail after parsing
valuable amount of data.

-- 
Best regards,
Konstantin Gribov.
28.03.2014 14:27 пользователь Stefano Fornari stefano.forn...@gmail.com
написал:

 On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari 
 stefano.forn...@gmail.com
  wrote:

  I understood the trick, but I am trying to understand this is done in
 this
  way (that at a first glance does not seem clean).
 
  ... trying to understand why this is done in this way...



Re: How should video files with audio be handled by parsers?

2014-03-28 Thread Nick Burch

On Fri, 28 Mar 2014, Konstantin Gribov wrote:
I think you should have three info blocks: video streams, audio streams 
and subtitles (if container supports their embedding). Sort naturally or 
by vid/aid/sid if present.


That's not something Tika supports though. We have a metadata object we 
can populate with some things, or we can trigger for embedded objects. 
The Metadata object doesn't support nesting


Nick


Re: PDF parser (two more questions)

2014-03-28 Thread Stefano Fornari
well, I should look at the code, I can't do it now, but I guess my point is
that BodyContentHandler should not throw the exception (and most probably
not a SAXException in any case) in the case the limit is reached. This
means that the limit should not put on the WriteOutContentHandler, but on
BodyContentHandler.

Ste


On Fri, Mar 28, 2014 at 11:52 AM, Konstantin Gribov gros...@gmail.comwrote:

 SAXException is checked, so you have to catch it or add to method throws
 list (or javac wouldn't compile it). Tika usually rethrows exceptions
 enveloping them into TikaException. In case of code above method throws
 SAXException.

 Suppressing the exception is done to avoid parser fail after parsing
 valuable amount of data.

 --
 Best regards,
 Konstantin Gribov.
 28.03.2014 14:27 пользователь Stefano Fornari stefano.forn...@gmail.com
 
 написал:

  On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari 
  stefano.forn...@gmail.com
   wrote:
 
   I understood the trick, but I am trying to understand this is done in
  this
   way (that at a first glance does not seem clean).
  
   ... trying to understand why this is done in this way...
 



Re: PDF parser (two more questions)

2014-03-28 Thread Konstantin Gribov
All such handlers are implementation of org.xml.sax.ContentHandler
interface, so thier methods throws SAXException. But in code above none of
contentHandler methods are invoked (only in parser.parse where content
handler is passed).

You can take a look at org.apache.tika.Tika.parseToString(InputSteam,
Metadata, int) as a reference. It has code similar to Jukka's code above.


-- 
Best regards,
Konstantin Gribov.


2014-03-28 15:47 GMT+04:00 Stefano Fornari stefano.forn...@gmail.com:

 well, I should look at the code, I can't do it now, but I guess my point is
 that BodyContentHandler should not throw the exception (and most probably
 not a SAXException in any case) in the case the limit is reached. This
 means that the limit should not put on the WriteOutContentHandler, but on
 BodyContentHandler.

 Ste


 On Fri, Mar 28, 2014 at 11:52 AM, Konstantin Gribov gros...@gmail.com
 wrote:

  SAXException is checked, so you have to catch it or add to method throws
  list (or javac wouldn't compile it). Tika usually rethrows exceptions
  enveloping them into TikaException. In case of code above method throws
  SAXException.
 
  Suppressing the exception is done to avoid parser fail after parsing
  valuable amount of data.
 
  --
  Best regards,
  Konstantin Gribov.
  28.03.2014 14:27 пользователь Stefano Fornari 
 stefano.forn...@gmail.com
  
  написал:
 
   On Fri, Mar 28, 2014 at 11:26 AM, Stefano Fornari 
   stefano.forn...@gmail.com
wrote:
  
I understood the trick, but I am trying to understand this is done in
   this
way (that at a first glance does not seem clean).
   
... trying to understand why this is done in this way...
  
 



Re: How should video files with audio be handled by parsers?

2014-03-28 Thread Konstantin Gribov
I said it about output to content handler, not to metadata. How to handle
metadata for containers with several video streams is another problem. Tika
metadata model is something weird for me, so I try to do not look at it too
often =)

-- 
Best regards,
Konstantin Gribov.


2014-03-28 14:59 GMT+04:00 Nick Burch apa...@gagravarr.org:

 On Fri, 28 Mar 2014, Konstantin Gribov wrote:

 I think you should have three info blocks: video streams, audio streams
 and subtitles (if container supports their embedding). Sort naturally or by
 vid/aid/sid if present.


 That's not something Tika supports though. We have a metadata object we
 can populate with some things, or we can trigger for embedded objects. The
 Metadata object doesn't support nesting

 Nick



[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950628#comment-13950628
 ] 

Tim Allison commented on TIKA-1010:
---

Chris,
  Thank you for digging into the spec and sharing test files.  For some reason, 
I can't find the gifs that JIRA reported you attaching earlier today.  Y, I'm 
in search of a binary test file.  Please share one if you can find it.  I think 
I'm good on package files.  I'll attach my two main test files shortly.  I've 
tested on the files within the zip that you submitted earlier, and all is good.

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1010:
--

Attachment: testRTFRegularImages.rtf

This is an example of regular images -- pict -- not embedded data.

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-1010:
--

Attachment: testRTF_embbededFiles.zip

This is the test file I'll use to test poifs package and embedded object 
formatted data.

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
 testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950714#comment-13950714
 ] 

Chris Bamford commented on TIKA-1010:
-

Hi Tim

Sorry about the confusion with the GIFs - they were nothing to do with the 
case!  Thy were part of my email footer which Jira automatically attached to 
the ticket when I replied by email!
So I removed them.

Sounds like you're making great progress.  I will provide a binary pict file as 
soon as I can locate one.

Cheers,

- Chris

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
 testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (TIKA-1244) Better parsing of Mbox files

2014-03-28 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen reassigned TIKA-1244:
--

Assignee: Hong-Thai Nguyen

 Better parsing of Mbox files
 

 Key: TIKA-1244
 URL: https://issues.apache.org/jira/browse/TIKA-1244
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.5
Reporter: Luis Filipe Nassif
Assignee: Hong-Thai Nguyen
 Attachments: MboxParser.java.patch


 MboxParser currently looses metadata of all emails, except first. It does not 
 extract/parse emails, nor decode parts. It should handle embedded emails like 
 other container parsers do, so emails will be automatically parsed by 
 RFC822Parser. I will try to add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: metadata key for original file path?

2014-03-28 Thread Nick Burch

On Fri, 28 Mar 2014, Allison, Timothy B. wrote:
In working on TIKA-1010, there are some cases where the full original 
file path is stored with an image or embedded document. 
TikaMetadatakeys.RESOURCE_NAME_KEY should be used for file name 
(right?), but what should I use for file path?


I can only suggest looking at what the zip (+ other archive formats) code 
does, that should be a good guide to embedded resources where we know the 
name of the resource


Nick


[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951009#comment-13951009
 ] 

Chris Bamford commented on TIKA-1010:
-

Hi again Tim

Dunno if this helps, but there is a generic RTF parser kit on GitHub 
(https://github.com/joniles/rtfparserkit) which knows how to navigate RTFs.
I'm playing with it now - it doesn't do much except parse but it might provide 
some insight?

Cheers,

- Chris

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
 testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004
 ] 

Tim Allison edited comment on TIKA-1010 at 3/28/14 4:47 PM:


Chris,
  Thanks for pointing that out.  The objdata in logo.rtf is of type pbrush 
(bitmap) but it is encoded in the regular hexpairs.  I just added handling for 
that. 
  The thumbnail/result/metafile is binary, but I've chosen not to extract 
thumbnails/emf or other meta-embeddings.
  This ok?




was (Author: talli...@mitre.org):
Chris,
  Thanks for pointing that out.  The objdata in logo.rtf is of type pbrush 
(bitmap) but it is encoded in the regular hexpairs.  I just added handling for 
that.

I'm not sure that that is what is meant by the binary pict type you pointed 
out in the spec.

{noformat}An RTF file can include pictures created with other applications. 
These pictures can be in hexadecimal (the default) or binary format. Pictures 
are destinations, and begin with the \pict control word. The \pict keyword is 
preceded by \*\shppict destination control keyword as described in the 
following example. A picture destination has the following syntax: pict '{' 
\pict (brdr?  shading?  picttype  pictsize  metafileinfo?) data 
'}' 
picttype | \emfblip | \pngblip | \jpegblip | \macpict | \pmmetafile | 
\wmetafile | \dibitmap bitmapinfo | \wbitmap bitmapinfo 
bitmapinfo \wbmbitspixel  \wbmplanes  \wbmwidthbytes  
pictsize (\picw  \pich) \picwgoal?  \pichgoal? \picscalex?  \picscaley?  
\picscaled?  \piccropt?  \piccropb?  \piccropr?  \piccropl? 
metafileinfo \picbmp  \picbpp  
data (\bin #BDATA) | #SDATA 
{noformat}

My guess from that is that we'd see something like:
{noformat}
{pict ...\bin 0101000100010001
{noformat}


 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
 testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004
 ] 

Tim Allison commented on TIKA-1010:
---

Chris,
  Thanks for pointing that out.  The objdata in logo.rtf is of type pbrush 
(bitmap) but it is encoded in the regular hexpairs.  I just added handling for 
that.

I'm not sure that that is what is meant by the binary pict type you pointed 
out in the spec.

{noformat}An RTF file can include pictures created with other applications. 
These pictures can be in hexadecimal (the default) or binary format. Pictures 
are destinations, and begin with the \pict control word. The \pict keyword is 
preceded by \*\shppict destination control keyword as described in the 
following example. A picture destination has the following syntax: pict '{' 
\pict (brdr?  shading?  picttype  pictsize  metafileinfo?) data 
'}' 
picttype | \emfblip | \pngblip | \jpegblip | \macpict | \pmmetafile | 
\wmetafile | \dibitmap bitmapinfo | \wbitmap bitmapinfo 
bitmapinfo \wbmbitspixel  \wbmplanes  \wbmwidthbytes  
pictsize (\picw  \pich) \picwgoal?  \pichgoal? \picscalex?  \picscaley?  
\picscaled?  \piccropt?  \piccropb?  \piccropr?  \piccropl? 
metafileinfo \picbmp  \picbpp  
data (\bin #BDATA) | #SDATA 
{noformat}

My guess from that is that we'd see something like:
{noformat}
{pict {objdata \bin 0101000100010001
{noformat}


 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
 testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  

[jira] [Comment Edited] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951004#comment-13951004
 ] 

Tim Allison edited comment on TIKA-1010 at 3/28/14 4:44 PM:


Chris,
  Thanks for pointing that out.  The objdata in logo.rtf is of type pbrush 
(bitmap) but it is encoded in the regular hexpairs.  I just added handling for 
that.

I'm not sure that that is what is meant by the binary pict type you pointed 
out in the spec.

{noformat}An RTF file can include pictures created with other applications. 
These pictures can be in hexadecimal (the default) or binary format. Pictures 
are destinations, and begin with the \pict control word. The \pict keyword is 
preceded by \*\shppict destination control keyword as described in the 
following example. A picture destination has the following syntax: pict '{' 
\pict (brdr?  shading?  picttype  pictsize  metafileinfo?) data 
'}' 
picttype | \emfblip | \pngblip | \jpegblip | \macpict | \pmmetafile | 
\wmetafile | \dibitmap bitmapinfo | \wbitmap bitmapinfo 
bitmapinfo \wbmbitspixel  \wbmplanes  \wbmwidthbytes  
pictsize (\picw  \pich) \picwgoal?  \pichgoal? \picscalex?  \picscaley?  
\picscaled?  \piccropt?  \piccropb?  \piccropr?  \piccropl? 
metafileinfo \picbmp  \picbpp  
data (\bin #BDATA) | #SDATA 
{noformat}

My guess from that is that we'd see something like:
{noformat}
{pict ...\bin 0101000100010001
{noformat}



was (Author: talli...@mitre.org):
Chris,
  Thanks for pointing that out.  The objdata in logo.rtf is of type pbrush 
(bitmap) but it is encoded in the regular hexpairs.  I just added handling for 
that.

I'm not sure that that is what is meant by the binary pict type you pointed 
out in the spec.

{noformat}An RTF file can include pictures created with other applications. 
These pictures can be in hexadecimal (the default) or binary format. Pictures 
are destinations, and begin with the \pict control word. The \pict keyword is 
preceded by \*\shppict destination control keyword as described in the 
following example. A picture destination has the following syntax: pict '{' 
\pict (brdr?  shading?  picttype  pictsize  metafileinfo?) data 
'}' 
picttype | \emfblip | \pngblip | \jpegblip | \macpict | \pmmetafile | 
\wmetafile | \dibitmap bitmapinfo | \wbitmap bitmapinfo 
bitmapinfo \wbmbitspixel  \wbmplanes  \wbmwidthbytes  
pictsize (\picw  \pich) \picwgoal?  \pichgoal? \picscalex?  \picscaley?  
\picscaled?  \piccropt?  \piccropb?  \piccropr?  \piccropl? 
metafileinfo \picbmp  \picbpp  
data (\bin #BDATA) | #SDATA 
{noformat}

My guess from that is that we'd see something like:
{noformat}
{pict {objdata \bin 0101000100010001
{noformat}


 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
 testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951097#comment-13951097
 ] 

Chris Bamford commented on TIKA-1010:
-

The binary actually looks like this:
{noformat}
{pict {objdata \bin270141 .
{noformat}
The 270,141 after \bin is the number of bytes to read (size of blob).  Note 
that the blob can (and does in this case!) contain '}' which is not the group 
end marker but actual 'binary' data.

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
 testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951092#comment-13951092
 ] 

Chris Bamford commented on TIKA-1010:
-

Ideally I'd like to be able to extract any file, but let's get the main cases 
covered off first!


 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
 testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951107#comment-13951107
 ] 

Tim Allison commented on TIKA-1010:
---

Y, thanks, I got that.  I can add an extract all mode vs extract logical.  
I'll set the default to extract logical unless there are objections from the 
community.

I now have success against the two files I posted earlier today.

When I add the extract all vs logical parameter, I can use Mike's 
testBinControlWord (from TIKA-782) as a test.

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
 testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13951116#comment-13951116
 ] 

Tim Allison commented on TIKA-1010:
---

As a side note, I can grab file names for:
 1) images that are regular images (not embedded objdata)
 2) non-POIFS embedded files (html, embedded objdata images, zip, msg)

I can't find file names for:
xls, xlsx, doc, docx, ppt, pptx, pdf

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
 testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


How to exclude a mimetype form being indexed in solr using tika?

2014-03-28 Thread eShard
Good afternoon,
I already asked this question in the solr - user forum and I didn't get
anywhere.
They suggested I ask the tika community...
I'm using solr 4.0 Final
I need movies hidden in zip files that need to be excluded from the index.
I can't filter movies on the crawler because then I would have to exclude
all zip files.

I was told I can have tika skip the movies.
the details are escaping me at this point.

How do I exclude a file in the tika configuration?
I assume it's something I add in the update/extract handler but I'm not
sure.

Thanks, 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-exclude-a-mimetype-form-being-indexed-in-solr-using-tika-tp4127767.html
Sent from the Apache Tika - Development mailing list archive at Nabble.com.


[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-03-28 Thread Chris Bamford (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13950835#comment-13950835
 ] 

Chris Bamford commented on TIKA-1010:
-

Hi Tim

I have found one - please see https://issues.apache.org/jira/browse/TIKA-782 
(logo.rtf inside logo.zip).

Best,

- Chris

 Embedded documents in RTF are not extracted
 ---

 Key: TIKA-1010
 URL: https://issues.apache.org/jira/browse/TIKA-1010
 Project: Tika
  Issue Type: Bug
  Components: parser
Reporter: Michael McCandless
Assignee: Tim Allison
 Attachments: ExampleRTFs.zip, outer.rtf, testRTFRegularImages.rtf, 
 testRTF_embbededFiles.zip


 When an RTF doc embeds a doc it looks like this:
 {noformat}
 {\object\objemb
 \objw628\objh765{\*\objclass Package}{\*\objdata 
 0105020008005061636b616765006600
 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
 5404bbfaee00080054044505
 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
 {noformat}
 But, unfortunately, the format of those hex bytes is not spelled out
 in the RTF spec ... the spec merely says the bytes are saved by the
 OLESaveToStream function ... and I haven't been able to find a
 description of what the bytes mean.
 In this case they are a Package object (\objclass Package), which I
 think is an [old?] way to wrap any non-OLE file (this is just a .txt
 file).
 Here's the hex dump:
 {noformat}
   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt.|
 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |..Syste|
 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.|
 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
 0190  01 01 00 03 00 00 00 00  00   |.|
 0199
 {noformat}
 Anyway I have no idea how to decode the bytes at this point ... just
 opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: How to exclude a mimetype form being indexed in solr using tika?

2014-03-28 Thread Nick Burch

On Fri, 28 Mar 2014, eShard wrote:

I'm using solr 4.0 Final
I need movies hidden in zip files that need to be excluded from the index.
I can't filter movies on the crawler because then I would have to exclude
all zip files.


If you're calling Tika directly, this is very easy. When tika hits 
embedded resources, it'll call out to your code, and you can select then 
if you want to process each one or ignore each one


(This is all done via an EmbeddedDocumentExtractor, which you supply on 
the ParseContext)


How do I exclude a file in the tika configuration? I assume it's 
something I add in the update/extract handler but I'm not sure.


I've no idea how / if you can tell the SOLR code to ask Tika to do that or 
not, that's something you'll have to go back to the SOLR community about 
as they maintain that code


Nick