Hi Antony,

At 01:14 14.11.2006, Antony Bowesman wrote:
>I'm using Lucene+POI to index documents.  Text extraction from a
>Word document 
>fails, either using HDF WordDocument or HWPF WordExtractor.  Esentially it is 
>the same IOException of
>
>java.io.IOException: Unable to read entire block; 511 bytes read; 
>expected 512 bytes

I'm not too sure about POIFS - I would expect that Word files using
the OLE2 docfile format should have a size which is a multiple of 512.

>The original does seem to be a Word document, i.e. it's not RTF and 
>has similar 
>binary structure as other .doc files.

Do you know which version it is? (Can you find something like
"Word.Document.#" where # is a number in the hex dump?)

>Is it usual to find documents that are not padded to 512 byte boundaries. 
>Looking back at old Word documents from the '90s, I can see a number 
>that are not.
[...]
>As an experiment I took one of the old docs and padded it suitably 
>and got the Exception
[...]
>Caused by: java.io.FileNotFoundException: no such entry: "0Table"

The Word files which HDF/HWPF can handle must have a table stream
with the name "0Table" or "1Table". So either the file is not an OLE2
docfile, or it is but does not have a table stream (not sure whether
the second case exists or not).
So my guess is, that the '90s file is too old for HWPF in that it is
not an OLE2 docfile.

>Does anyone know the rule about this?  Are non 512 byte padded 
>documents invalid 
>or just some older version of the doc format.

Maybe a quick indicator would be: Look for "0_T_a_b_l_e_" or
"1_T_a_b_l_e_" in the hex dump ('_' shall represent the 0x00 byte for
now). If its not there, HWPF/HDF can't read it. 

Best wishes,
Rainer


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Reply via email to