Re: Indexing MSword Documents

2007-06-09 Thread jim shirreffs
UN_TOKENIZED)); /* * Add title */ doc.add(new Field("kcmititle", title, Field.Store.YES, Field.Index.UN_TOKENIZED)); /* * return the document */ return doc; } } - Original Message - From: "Wayne Graham" <[EMAIL PROTECTED]> To: Sent: Fr

Re: Indexing MSword Documents

2007-06-08 Thread Wayne Graham
me" > <[EMAIL PROTECTED]> > To: > Sent: Friday, June 08, 2007 12:48 PM > Subject: Re: Indexing MSword Documents > > > Why don't use Document? > http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/ > org/apache/lucene/document/Document.htm

Re: Indexing MSword Documents

2007-06-08 Thread jim shirreffs
taking the time to reply jim s - Original Message - From: "Mathieu Lecarme" <[EMAIL PROTECTED]> To: Sent: Friday, June 08, 2007 12:48 PM Subject: Re: Indexing MSword Documents Why don't use Document? http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightl

Re: Indexing MSword Documents

2007-06-08 Thread jim shirreffs
many thanks I will try that, thanks again! jim s - Original Message - From: "Donna L Gresh" <[EMAIL PROTECTED]> To: Sent: Friday, June 08, 2007 12:52 PM Subject: Re: Indexing MSword Documents I do this exact thing. "text" (the second input to the Field co

Re: Indexing MSword Documents

2007-06-08 Thread Donna L Gresh
I do this exact thing. "text" (the second input to the Field constructor) is MSWord text that I've extracted from the Word document textField = new org.apache.lucene.document.Field(textFieldName,text, org.apache.lucene.document.Field.Store.NO, org.apache.lucene.document.Field.Index.TOKENIZED);

Re: Indexing MSword Documents

2007-06-08 Thread Mathieu Lecarme
Why don't use Document? http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/ org/apache/lucene/document/Document.html HTMLDocument manage HTML stuff like encoding, header, and other specificity. Nutch use specific word tools (http://lucene.apache.org/nutch/apidocs/ org/ap

Indexing MSword Documents

2007-06-08 Thread jim shirreffs
Hi, I am trying to index msword documents. I've got things working but I do not think I am doing things properly. To index msword docs I use an extractor to extract the text. Then I write the text to a .txt file and index that using an HTMLDocument object. Seems to me that since I have the te