I'm using Lucene+POI to index documents. Text extraction from a Word document fails, either using HDF WordDocument or HWPF WordExtractor. Esentially it is the same IOException of

java.io.IOException: Unable to read entire block; 511 bytes read; expected 512 
bytes

coming from

org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:62)
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:51)
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:86)

The original Word document is 28671 bytes long and this is 1 byte short of a 512 byte boundary. If I use Word to just remove the final line of the document and resave it, it becomes a 512 byte bounded 28672.

The original does seem to be a Word document, i.e. it's not RTF and has similar binary structure as other .doc files.

Is it usual to find documents that are not padded to 512 byte boundaries. Looking back at old Word documents from the '90s, I can see a number that are not.

As an experiment I took one of the old docs and padded it suitably and got the Exception

Caused by: java.io.FileNotFoundException: no such entry: "0Table"
at org.apache.poi.poifs.filesystem.DirectoryNode.getEntry(DirectoryNode.java:245)
        at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:134)
at org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:39) at org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:31)

Does anyone know the rule about this? Are non 512 byte padded documents invalid or just some older version of the doc format.

Can anyone shed any light...
Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Reply via email to