Rainer, Yes, this sounds like something to investigate. However, in my app, no assumptions can be made about data, it can be anything. Embeddings may be of any type. So I'm fishing for a generic solution.
- Dmitry ----- Original Message ----- From: Rainer Schwarze <[EMAIL PROTECTED]> To: POI Users List <[email protected]> Sent: Thu Aug 28 18:49:17 2008 Subject: Re: How to extract embedded files from Office 07 Dmitry Goldenberg wrote: > Yegor, > > The first 8 bytes contain the standard MS Office magic number stuff - d0 cf > 11 e0 a1 b1 1a e1. > > Seems like they compress data in a proprietary way. I've read one post where > someone recommended the .NET Packaging API to crack these ... Not a good > option ... Hi Dmitry, this may be interesting (unless you already found it): http://www.nabble.com/Can-POIFS-convert-PDF-to-OLE-td18568081.html Looking at such things I suspect this: The data is inside "Ole10Native". This could be extracted using POIFS. The structures there look like this: [4 bytes] = size of structure including data [???] a few flags and strings (zero terminated) [4 bytes] = size of actually embedded binary data [???] = the actual binary data If you know that it is a ZIP file, you could search for a byte sequence [size]"PK", where [size] depends on the search position. Assume you start immediately after the first 4 bytes for total length, then the size value is length-4. Step further by one byte and check for the sequence with size set to length-5 a.s.o. When the 6 bytes match the expected [size]PK sequence, you can be somewhat sure, that "PK" represents the start of the ZIP file and [size] is its size. Of course nothing beats the analysis of the actual binary data structure :-) (Would this be worth the effort for your purpose?) Best wishes, Rainer -- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
