Rainer,

Yes, this sounds like something to investigate. However, in my app, no 
assumptions can be made about data, it can be anything. Embeddings may be of 
any type. So I'm fishing for a generic solution.

- Dmitry

----- Original Message -----
From: Rainer Schwarze <[EMAIL PROTECTED]>
To: POI Users List <[email protected]>
Sent: Thu Aug 28 18:49:17 2008
Subject: Re: How to extract embedded files from Office 07

Dmitry Goldenberg wrote:
> Yegor,
>
> The first 8 bytes contain the standard MS Office magic number stuff - d0 cf 
> 11 e0 a1 b1 1a e1.
>
> Seems like they compress data in a proprietary way. I've read one post where 
> someone recommended the .NET Packaging API to crack these ...  Not a good 
> option ...

Hi Dmitry,

this may be interesting (unless you already found it):

http://www.nabble.com/Can-POIFS-convert-PDF-to-OLE-td18568081.html


Looking at such things I suspect this:

The data is inside "Ole10Native". This could be extracted using POIFS.
The structures there look like this:

[4 bytes] = size of structure including data
[???] a few flags and strings (zero terminated)
[4 bytes] = size of actually embedded binary data
[???] = the actual binary data

If you know that it is a ZIP file, you could search for a byte sequence
[size]"PK", where [size] depends on the search position. Assume you
start immediately after the first 4 bytes for total length, then the
size value is length-4. Step further by one byte and check for the
sequence with size set to length-5 a.s.o. When the 6 bytes match the
expected [size]PK sequence, you can be somewhat sure, that "PK"
represents the start of the ZIP file and [size] is its size.

Of course nothing beats the analysis of the actual binary data structure
:-) (Would this be worth the effort for your purpose?)

Best wishes, Rainer
--

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to