Word document extraction fails - bad file length

Antony Bowesman Mon, 13 Nov 2006 16:15:14 -0800

I'm using Lucene+POI to index documents. Text extraction from a Word documentfails, either using HDF WordDocument or HWPF WordExtractor. Esentially it isthe same IOException of


java.io.IOException: Unable to read entire block; 511 bytes read; expected 512 
bytes


coming from

org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:62)
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:51)
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:86)

The original Word document is 28671 bytes long and this is 1 byte short of a 512byte boundary. If I use Word to just remove the final line of the document andresave it, it becomes a 512 byte bounded 28672.

The original does seem to be a Word document, i.e. it's not RTF and has similarbinary structure as other .doc files.

Is it usual to find documents that are not padded to 512 byte boundaries.Looking back at old Word documents from the '90s, I can see a number that are not.

As an experiment I took one of the old docs and padded it suitably and got theException


Caused by: java.io.FileNotFoundException: no such entry: "0Table"

atorg.apache.poi.poifs.filesystem.DirectoryNode.getEntry(DirectoryNode.java:245)

        at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:134)

atorg.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:39)atorg.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:31)

Does anyone know the rule about this? Are non 512 byte padded documents invalidor just some older version of the doc format.


Can anyone shed any light...
Antony


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Word document extraction fails - bad file length

Reply via email to