I'm using Lucene+POI to index documents. Text extraction from a Word document
fails, either using HDF WordDocument or HWPF WordExtractor. Esentially it is
the same IOException of
java.io.IOException: Unable to read entire block; 511 bytes read; expected 512
bytes
coming from
org.apache.poi.poifs.storage.RawDataBlock.<init>(RawDataBlock.java:62)
org.apache.poi.poifs.storage.RawDataBlockList.<init>(RawDataBlockList.java:51)
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:86)
The original Word document is 28671 bytes long and this is 1 byte short of a 512
byte boundary. If I use Word to just remove the final line of the document and
resave it, it becomes a 512 byte bounded 28672.
The original does seem to be a Word document, i.e. it's not RTF and has similar
binary structure as other .doc files.
Is it usual to find documents that are not padded to 512 byte boundaries.
Looking back at old Word documents from the '90s, I can see a number that are not.
As an experiment I took one of the old docs and padded it suitably and got the
Exception
Caused by: java.io.FileNotFoundException: no such entry: "0Table"
at
org.apache.poi.poifs.filesystem.DirectoryNode.getEntry(DirectoryNode.java:245)
at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:134)
at
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:39)
at
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:31)
Does anyone know the rule about this? Are non 512 byte padded documents invalid
or just some older version of the doc format.
Can anyone shed any light...
Antony
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/