date:20140416

[jira] [Created] (TIKA-1275) Upgrade Commons compress (to 1.9)

2014-04-16 Thread Fabian Lange (JIRA)

Fabian Lange created TIKA-1275:
--

 Summary: Upgrade Commons compress (to 1.9)
 Key: TIKA-1275
 URL: https://issues.apache.org/jira/browse/TIKA-1275
 Project: Tika
  Issue Type: Bug
Reporter: Fabian Lange


Hi,
I am using Tika to detect content also from archives. But because the raw input 
stream is a CipherInputStream I ran into 
https://issues.apache.org/jira/browse/COMPRESS-277
which compress kindly solved for me.
To be able to use Tika without patching my stack, I would like to see an 
upgrade of commons compress to 1.9 as soon as it is out.
This may, or may not be in 1.6 timeframe.

Thanks!




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1274) ENVI header parser

2014-04-16 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972315#comment-13972315
 ] 

Chris A. Mattmann commented on TIKA-1274:
-

Thanks for attaching the ENVI parser Annie! Nick, great comments, perfect. 
Annie, if you need any help here I'd be happy to help commit the work of course 
crediting you along the way. Feel free to use Review Board too and to add a 
patch there (http://reviews.apache.org/) and select the Tika group. Also if you 
are so inclined you can use Github too and just submit a pull request (which in 
turn will submit an email message with a link to your patch to the dev list).

Thanks!

> ENVI header parser
> --
>
> Key: TIKA-1274
> URL: https://issues.apache.org/jira/browse/TIKA-1274
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.5
>Reporter: Ann Burgess
>  Labels: mime, newbie, parser, patch
>
> I have written a parser that extracts text and metadata from ENVI header 
> files, currently called at the command line as: 
> abryant:tika abryant$ java -classpath 
> annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
> org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr
>Content-Encoding: ISO-8859-1
>Content-Length: 818
>Content-Type: application/envi.hdr
>resourceName: MOD09GA_test_header.hdr
> abryant:tika abryant$ java -classpath 
> annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
> org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr
> ENVI
> description = {
>   GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]}
> samples = 2400
> lines   = 2400
> bands   = 7
> header offset = 0
> file type = ENVI Standard
> data type = 2
> interleave = bip
> sensor type = Unknown
> byte order = 0
> map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 
> 4.6331271653e+02, 4.6331271653e+02, , units=Meters}
> projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, 
> units=Meters}
> coordinate system string = 
> {PROJCS["Sinusoidal",GEOGCS["GCS_ELLIPSE_BASED_1",DATUM["D_ELLIPSE_BASED_1",SPHEROID["S_ELLIPSE_BASED_1",6371007.181,0.0]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Sinusoidal"],PARAMETER["False_Easting",0.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",0.0],UNIT["Meter",1.0]]}
> wavelength units = Unknown
> __
> As a current non-certified committer, could someone enlighten me to the steps 
> needed to submit this new parser for review.  
> The parser is located in my directory structure as: 
> /users/annbryant/tika/tika/anniedev/src/main/java/edu/usc/sunset/abburgess/tika/EnviFileReader.class
> My custom mimetypes.xml file is located at: 
> /Users/annbryant/TIKA/tika/anniedev/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1274) ENVI header parser

2014-04-16 Thread Nick Burch (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13971978#comment-13971978
 ] 

Nick Burch commented on TIKA-1274:
--

If this were changes to existing files, we'd need a patch file for the changes 
to review

As it's all new files, what we'd need attaching to the ticket are:
 * The custom-mimetypes file that defines your new format
 * The parser java file(s)
 * A sample ENVI header file
 * A unit test file that tests the detection and parsing
 * Details of any new dependencies (if any)

For general advice on contributing, patches, tests etc, the Apache Nutch 
project has some good wiki pages describing all of that, most of which will 
apply equally to Apache Tika too:
 * https://wiki.apache.org/nutch/HowToContribute
 * https://wiki.apache.org/nutch/Becoming_A_Nutch_Developer

Another good source is the ComDev (Apache Community Development) site - pick 
"For Contributors" from the menu and look through the pages in that section

For an example of a simple Tika parser + simple Tika parser unit test, I can 
suggest the VorbisParser from late 2011, when it largely only supported the one 
file (Ogg Vorbis), before additional Ogg based formats were added in. You can 
see that at something like 
https://github.com/Gagravarr/VorbisJava/tree/f6d20407477011735c16daf947635f1b67e14660/tika

> ENVI header parser
> --
>
> Key: TIKA-1274
> URL: https://issues.apache.org/jira/browse/TIKA-1274
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.5
>Reporter: Ann Burgess
>  Labels: mime, newbie, parser, patch
>
> I have written a parser that extracts text and metadata from ENVI header 
> files, currently called at the command line as: 
> abryant:tika abryant$ java -classpath 
> annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
> org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr
>Content-Encoding: ISO-8859-1
>Content-Length: 818
>Content-Type: application/envi.hdr
>resourceName: MOD09GA_test_header.hdr
> abryant:tika abryant$ java -classpath 
> annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
> org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr
> ENVI
> description = {
>   GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]}
> samples = 2400
> lines   = 2400
> bands   = 7
> header offset = 0
> file type = ENVI Standard
> data type = 2
> interleave = bip
> sensor type = Unknown
> byte order = 0
> map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 
> 4.6331271653e+02, 4.6331271653e+02, , units=Meters}
> projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, 
> units=Meters}
> coordinate system string = 
> {PROJCS["Sinusoidal",GEOGCS["GCS_ELLIPSE_BASED_1",DATUM["D_ELLIPSE_BASED_1",SPHEROID["S_ELLIPSE_BASED_1",6371007.181,0.0]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Sinusoidal"],PARAMETER["False_Easting",0.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",0.0],UNIT["Meter",1.0]]}
> wavelength units = Unknown
> __
> As a current non-certified committer, could someone enlighten me to the steps 
> needed to submit this new parser for review.  
> The parser is located in my directory structure as: 
> /users/annbryant/tika/tika/anniedev/src/main/java/edu/usc/sunset/abburgess/tika/EnviFileReader.class
> My custom mimetypes.xml file is located at: 
> /Users/annbryant/TIKA/tika/anniedev/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (TIKA-1274) ENVI header parser

2014-04-16 Thread Ann Burgess (JIRA)

Ann Burgess created TIKA-1274:
-

 Summary: ENVI header parser
 Key: TIKA-1274
 URL: https://issues.apache.org/jira/browse/TIKA-1274
 Project: Tika
  Issue Type: New Feature
  Components: parser
Affects Versions: 1.5
Reporter: Ann Burgess


I have written a parser that extracts text and metadata from ENVI header files, 
currently called at the command line as: 

abryant:tika abryant$ java -classpath 
annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
org.apache.tika.cli.TikaCLI --metadata MOD09GA_test_header.hdr

   Content-Encoding: ISO-8859-1
   Content-Length: 818
   Content-Type: application/envi.hdr
   resourceName: MOD09GA_test_header.hdr

abryant:tika abryant$ java -classpath 
annie-envi-parser.jar:tika-app/target/tika-app-1.6-SNAPSHOT.jar 
org.apache.tika.cli.TikaCLI --text MOD09GA_test_header.hdr

ENVI
description = {
  GEO-TIFF File Imported into ENVI [Fri May 25 14:06:23 2012]}
samples = 2400
lines   = 2400
bands   = 7
header offset = 0
file type = ENVI Standard
data type = 2
interleave = bip
sensor type = Unknown
byte order = 0
map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 
4.6331271653e+02, 4.6331271653e+02, , units=Meters}
projection info = {16, 6371007.2, 0.00, 0.0, 0.0, Sinusoidal, units=Meters}
coordinate system string = 
{PROJCS["Sinusoidal",GEOGCS["GCS_ELLIPSE_BASED_1",DATUM["D_ELLIPSE_BASED_1",SPHEROID["S_ELLIPSE_BASED_1",6371007.181,0.0]],PRIMEM["Greenwich",0.0],UNIT["Degree",0.0174532925199433]],PROJECTION["Sinusoidal"],PARAMETER["False_Easting",0.0],PARAMETER["False_Northing",0.0],PARAMETER["Central_Meridian",0.0],UNIT["Meter",1.0]]}
wavelength units = Unknown

__

As a current non-certified committer, could someone enlighten me to the steps 
needed to submit this new parser for review.  

The parser is located in my directory structure as: 
/users/annbryant/tika/tika/anniedev/src/main/java/edu/usc/sunset/abburgess/tika/EnviFileReader.class

My custom mimetypes.xml file is located at: 
/Users/annbryant/TIKA/tika/anniedev/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-16 Thread Tim Allison (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1010.
---

Resolution: Fixed

r1588005

Many thanks to [~cbamford] for testing and submitting test documents!

Many thanks to Simon Mourier for: 
http://stackoverflow.com/questions/14779647/extract-embedded-image-object-in-rtf

Chris, let me know if there are any surprises in the few mods I made since I 
published the first draft of the patch.

> Embedded documents in RTF are not extracted
> ---
>
> Key: TIKA-1010
> URL: https://issues.apache.org/jira/browse/TIKA-1010
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Tim Allison
> Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, 
> outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip, 
> xls_attachment_example.zip
>
>
> When an RTF doc embeds a doc it looks like this:
> {noformat}
> {\object\objemb
> \objw628\objh765{\*\objclass Package}{\*\objdata 
> 0105020008005061636b616765006600
> 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
> 5404bbfaee00080054044505
> 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
> 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
> {noformat}
> But, unfortunately, the format of those hex bytes is not spelled out
> in the RTF spec ... the spec merely says the bytes are saved by the
> OLESaveToStream function ... and I haven't been able to find a
> description of what the bytes mean.
> In this case they are a "Package object" (\objclass Package), which I
> think is an [old?] way to wrap any non-OLE file (this is just a .txt
> file).
> Here's the hex dump:
> {noformat}
>   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
> 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
> 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
> 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
> 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt."|
> 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
> 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
> 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
> 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
> 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
> 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
> 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
> 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
> 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
> 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
> 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
> 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
> 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
> 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
> 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
> 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
> 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
> 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |.."Syste|
> 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.&|
> 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
> 0190  01 01 00 03 00 00 00 00  00   |.|
> 0199
> {noformat}
> Anyway I have no idea how to decode the bytes at this point ... just
> opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-16 Thread Tim Allison (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13971622#comment-13971622
 ] 

Tim Allison commented on TIKA-1010:
---

Great to hear.  Thank you for your help in submitting test documents and 
offering feedback!   I'll commit a slightly updated patch tonight or tomorrow.  
I'd recommend asking on the tika-users list about plans for 1.6 or if there is 
a nightly build option through Maven.  I know that the "nightly" jenkins build 
has not been working so well.

> Embedded documents in RTF are not extracted
> ---
>
> Key: TIKA-1010
> URL: https://issues.apache.org/jira/browse/TIKA-1010
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Tim Allison
> Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, 
> outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip, 
> xls_attachment_example.zip
>
>
> When an RTF doc embeds a doc it looks like this:
> {noformat}
> {\object\objemb
> \objw628\objh765{\*\objclass Package}{\*\objdata 
> 0105020008005061636b616765006600
> 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
> 5404bbfaee00080054044505
> 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
> 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
> {noformat}
> But, unfortunately, the format of those hex bytes is not spelled out
> in the RTF spec ... the spec merely says the bytes are saved by the
> OLESaveToStream function ... and I haven't been able to find a
> description of what the bytes mean.
> In this case they are a "Package object" (\objclass Package), which I
> think is an [old?] way to wrap any non-OLE file (this is just a .txt
> file).
> Here's the hex dump:
> {noformat}
>   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
> 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
> 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
> 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
> 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt."|
> 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
> 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
> 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
> 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
> 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
> 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
> 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
> 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
> 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
> 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
> 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
> 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
> 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
> 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
> 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
> 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
> 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
> 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |.."Syste|
> 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.&|
> 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
> 0190  01 01 00 03 00 00 00 00  00   |.|
> 0199
> {noformat}
> Anyway I have no idea how to decode the bytes at this point ... just
> opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

2014-04-16 Thread Chris Bamford (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-1010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970597#comment-13970597
 ] 

Chris Bamford commented on TIKA-1010:
-

Tim

I have done a lot of testing now and am very happy with the new functionality.
Assuming others have no objections, when could it be made available in a Maven 
release?

Cheers,

- Chris

> Embedded documents in RTF are not extracted
> ---
>
> Key: TIKA-1010
> URL: https://issues.apache.org/jira/browse/TIKA-1010
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
>Assignee: Tim Allison
> Attachments: ExampleRTFs.zip, TIKA-1010.patch, TIKA-1010_patch.zip, 
> outer.rtf, testRTFRegularImages.rtf, testRTF_embbededFiles.zip, 
> xls_attachment_example.zip
>
>
> When an RTF doc embeds a doc it looks like this:
> {noformat}
> {\object\objemb
> \objw628\objh765{\*\objclass Package}{\*\objdata 
> 0105020008005061636b616765006600
> 020048772e74787400433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e7478740003002200433a5c444f43554d457e315c6967616c73685c4465736b746f705c48572e747874000b0048656c6c6f20576f726c64010505000d004d45544146494c455049435400
> 5404bbfaee00080054044505
> 01000903730002001c0005000b0205000c02320029001c00fb02f5ff900100015461686f6d6155170a7000fc070058b1f37761b1f3772040f5774936668304002d010500090205000102ff000500
> 0201010005002e01060009002105060048772e747874210015001c00fb021700bc020102022253797374656d493666830a0026008a018cfc070004002d0101000300}
> {noformat}
> But, unfortunately, the format of those hex bytes is not spelled out
> in the RTF spec ... the spec merely says the bytes are saved by the
> OLESaveToStream function ... and I haven't been able to find a
> description of what the bytes mean.
> In this case they are a "Package object" (\objclass Package), which I
> think is an [old?] way to wrap any non-OLE file (this is just a .txt
> file).
> Here's the hex dump:
> {noformat}
>   01 05 00 00 02 00 00 00  08 00 00 00 50 61 63 6b  |Pack|
> 0010  61 67 65 00 00 00 00 00  00 00 00 00 66 00 00 00  |age.f...|
> 0020  02 00 48 77 2e 74 78 74  00 43 3a 5c 44 4f 43 55  |..Hw.txt.C:\DOCU|
> 0030  4d 45 7e 31 5c 69 67 61  6c 73 68 5c 44 65 73 6b  |ME~1\igalsh\Desk|
> 0040  74 6f 70 5c 48 57 2e 74  78 74 00 00 00 03 00 22  |top\HW.txt."|
> 0050  00 00 00 43 3a 5c 44 4f  43 55 4d 45 7e 31 5c 69  |...C:\DOCUME~1\i|
> 0060  67 61 6c 73 68 5c 44 65  73 6b 74 6f 70 5c 48 57  |galsh\Desktop\HW|
> 0070  2e 74 78 74 00 0b 00 00  00 48 65 6c 6c 6f 20 57  |.txt.Hello W|
> 0080  6f 72 6c 64 00 00 01 05  00 00 05 00 00 00 0d 00  |orld|
> 0090  00 00 4d 45 54 41 46 49  4c 45 50 49 43 54 00 54  |..METAFILEPICT.T|
> 00a0  04 00 00 bb fa ff ff ee  00 00 00 08 00 54 04 45  |.T.E|
> 00b0  05 00 00 01 00 09 00 00  03 73 00 00 00 02 00 1c  |.s..|
> 00c0  00 00 00 00 00 05 00 00  00 0b 02 00 00 00 00 05  ||
> 00d0  00 00 00 0c 02 32 00 29  00 1c 00 00 00 fb 02 f5  |.2.)|
> 00e0  ff 00 00 00 00 00 00 90  01 00 00 00 01 00 00 00  ||
> 00f0  00 54 61 68 6f 6d 61 00  00 55 17 0a 70 00 fc 07  |.Tahoma..U..p...|
> 0100  00 58 b1 f3 77 61 b1 f3  77 20 40 f5 77 49 36 66  |.X..wa..w @.wI6f|
> 0110  83 04 00 00 00 2d 01 00  00 05 00 00 00 09 02 00  |.-..|
> 0120  00 00 00 05 00 00 00 01  02 ff ff ff 00 05 00 00  ||
> 0130  00 02 01 01 00 00 00 05  00 00 00 2e 01 06 00 00  ||
> 0140  00 09 00 00 00 21 05 06  00 48 77 2e 74 78 74 21  |.!...Hw.txt!|
> 0150  00 15 00 1c 00 00 00 fb  02 10 00 07 00 00 00 00  ||
> 0160  00 bc 02 00 00 00 00 01  02 02 22 53 79 73 74 65  |.."Syste|
> 0170  6d 00 00 49 36 66 83 00  00 0a 00 26 00 8a 01 00  |m..I6f.&|
> 0180  00 00 00 ff ff ff ff 8c  fc 07 00 04 00 00 00 2d  |...-|
> 0190  01 01 00 03 00 00 00 00  00   |.|
> 0199
> {noformat}
> Anyway I have no idea how to decode the bytes at this point ... just
> opening the issue in case anyone else does!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (TIKA-1275) Upgrade Commons compress (to 1.9)

[jira] [Commented] (TIKA-1274) ENVI header parser

[jira] [Commented] (TIKA-1274) ENVI header parser

[jira] [Created] (TIKA-1274) ENVI header parser

[jira] [Resolved] (TIKA-1010) Embedded documents in RTF are not extracted

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

[jira] [Commented] (TIKA-1010) Embedded documents in RTF are not extracted

7 matches

Site Navigation

Mail list logo

Footer information