Hi Antony, At 01:14 14.11.2006, Antony Bowesman wrote: >I'm using Lucene+POI to index documents. Text extraction from a >Word document >fails, either using HDF WordDocument or HWPF WordExtractor. Esentially it is >the same IOException of > >java.io.IOException: Unable to read entire block; 511 bytes read; >expected 512 bytes
I'm not too sure about POIFS - I would expect that Word files using the OLE2 docfile format should have a size which is a multiple of 512. >The original does seem to be a Word document, i.e. it's not RTF and >has similar >binary structure as other .doc files. Do you know which version it is? (Can you find something like "Word.Document.#" where # is a number in the hex dump?) >Is it usual to find documents that are not padded to 512 byte boundaries. >Looking back at old Word documents from the '90s, I can see a number >that are not. [...] >As an experiment I took one of the old docs and padded it suitably >and got the Exception [...] >Caused by: java.io.FileNotFoundException: no such entry: "0Table" The Word files which HDF/HWPF can handle must have a table stream with the name "0Table" or "1Table". So either the file is not an OLE2 docfile, or it is but does not have a table stream (not sure whether the second case exists or not). So my guess is, that the '90s file is too old for HWPF in that it is not an OLE2 docfile. >Does anyone know the rule about this? Are non 512 byte padded >documents invalid >or just some older version of the doc format. Maybe a quick indicator would be: Look for "0_T_a_b_l_e_" or "1_T_a_b_l_e_" in the hex dump ('_' shall represent the 0x00 byte for now). If its not there, HWPF/HDF can't read it. Best wishes, Rainer --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] Mailing List: http://jakarta.apache.org/site/mail2.html#poi The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/