Hi Paolo, When i try to use http://textmining.org lib, Im getting the text but after long "beeb" sound with "system hang mode" if my word document contains table format. how to control that..
-----Original Message----- From: Paolo Tortora [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 28, 2004 5:49 PM To: POI Users List; Ryan Ackley Subject: R: Word to plain text converter Here some piece of java code to extract plain text from Word: /**************** START SOURCE CODE *******************/ import org.apache.poi.poifs.filesystem.*; import org.apache.poi.util.LittleEndian; import java.util.ArrayList; import java.io.InputStream; import java.io.IOException; class WordExtractor { public WordExtractor() { } public String extractText(InputStream in) throws IOException { ArrayList text = new ArrayList(); POIFSFileSystem fsys = new POIFSFileSystem(in); DocumentEntry headerProps = (DocumentEntry)fsys.getRoot().getEntry("WordDocument"); DocumentInputStream din = fsys.createDocumentInputStream("WordDocument"); byte[] header = new byte[headerProps.getSize()]; din.read(header); din.close(); // Prende le informazioni dall'header del documento int info = LittleEndian.getShort(header, 0xa); boolean useTable1 = (info & 0x200) != 0; // Prende informazioni dalla piece table int complexOffset = LittleEndian.getInt(header, 0x1a2); String tableName = null; if (useTable1) { tableName = "1Table"; } else { tableName = "0Table"; } DocumentEntry table = (DocumentEntry)fsys.getRoot().getEntry(tableName); byte[] tableStream = new byte[table.getSize()]; din = fsys.createDocumentInputStream(tableName); din.read(tableStream); din.close(); din = null; fsys = null; table = null; headerProps = null; int multiple = findText(tableStream, complexOffset, text); StringBuffer sb = new StringBuffer(); int size = text.size(); tableStream = null; for (int x = 0; x < size; x++) { WordTextPiece nextPiece = (WordTextPiece)text.get(x); int start = nextPiece.getStart(); int length = nextPiece.getLength(); boolean unicode = nextPiece.usesUnicode(); String toStr = null; if (unicode) { toStr = new String(header, start, length * multiple, "UTF-16LE"); } else { toStr = new String(header, start, length , "ISO-8859-1"); } sb.append(toStr).append(" "); } return sb.toString(); } private static int findText(byte[] tableStream, int complexOffset, ArrayList text) throws IOException { //actual text int pos = complexOffset; int multiple = 2; //skips through the prms before we reach the piece table. These contain data //for actual fast saved files while(tableStream[pos] == 1) { pos++; int skip = LittleEndian.getShort(tableStream, pos); pos += 2 + skip; } if(tableStream[pos] != 2) { throw new IOException("corrupted Word file"); } else { //parse out the text pieces int pieceTableSize = LittleEndian.getInt(tableStream, ++pos); pos += 4; int pieces = (pieceTableSize - 4) / 12; for (int x = 0; x < pieces; x++) { int filePos = LittleEndian.getInt(tableStream, pos + ((pieces + 1) * 4) + (x * 8) + 2); boolean unicode = false; if ((filePos & 0x40000000) == 0) { unicode = true; } else { unicode = false; multiple = 1; filePos &= ~(0x40000000);//gives me FC in doc stream filePos /= 2; } int totLength = LittleEndian.getInt(tableStream, pos + (x + 1) * 4) - LittleEndian.getInt(tableStream, pos + (x * 4)); WordTextPiece piece = new WordTextPiece(filePos, totLength, unicode); text.add(piece); } } return multiple; } } class WordTextPiece { private int _fcStart; private boolean _usesUnicode; private int _length; public WordTextPiece(int start, int length, boolean unicode) { _usesUnicode = unicode; _length = length; _fcStart = start; } public boolean usesUnicode() { return _usesUnicode; } public int getStart() { return _fcStart; } public int getLength() { return _length; } } /************** END SOURCE CODE **************/ -----Messaggio originale----- Da: Ryan Ackley [mailto:[EMAIL PROTECTED] Inviato: Wednesday, January 28, 2004 12:16 AM A: POI Users List Oggetto: Re: Word to plain text converter http://textmining.org ----- Original Message ----- From: "Dimitri Pissarenko" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, January 27, 2004 4:54 PM Subject: Word to plain text converter Hello! I want to convert several Microsoft Word files to plain text files, so that I can search through them with grep (or with analogous search functions under Windows). Has someone already written such a converter? Is this tool perhaps open-source? TIA dap --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --- [This E-mail scanned for viruses by iRepublics.com Anti Virus Solutions] p.s. get your web hosted for free at iRepublics.com 33MB webspace free, SMS messaging, 30 email accounts --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
