First, when asking a new question, it's best to start a new subject.
Your question has nothing to do with the rest of the thread....

That said, you want to create a Reader to pass along. I'd think about
doing this by subclassing your MSWord class from the Reader class
and providing the necessary implementation of the abstract read method.

Best
Erick

On 6/8/07, jim shirreffs <[EMAIL PROTECTED]> wrote:


I am trying to index msword documents. I've got things working but I do
not
think I am doing things properly.

To index msword docs I use an extractor to extract the text. Then I write
the text to a .txt file and index that using an HTLMDocument object. Seems
to me that since I have the text I should be able to just do a

        Doc.add("content", the_text_from_the_word_doc, ???, ???);

But looking at Document.java it seems the field "content" requires a
reader.
So I write a temporary file to satified that requirement.

What I would like to have is an MSWORDDocument class that would take the
extracted text as a argument to the constructor and create a Ducument
object
that I could get.

If any one has any idea, please let me know.

Here is a code segment. Notice the msword hack,


/*

* make a document

*/

try

{

   if (ftype.startsWith("text"))

   {

      doc = HTMLDocument.Document(f);

   }

   else if (ftype.equals("application/pdf"))

   {

      doc = LucenePDFDocument.getDocument(f);

   }

   else if (ftype.equals("application/msword"))

   {

      FileInputStream fin = new FileInputStream(f.getAbsolutePath());

      WordExtractor extractor = new WordExtractor(fin);

      String content = extractor.getText();

      if(debug) System.out.println(content);

      String tempFileName=f.getAbsolutePath() + ".txt";

      BufferedWriter bw = new BufferedWriter(new FileWriter(tempFileName,
false));

      bw.write((String) content.toString());

      bw.close();

      File df = new File(tempFileName);

      doc = HTMLDocument.Document(df);

      df.delete();

   }

   else if (ftype.equals("binary"))

   {

      return null;

   }

   else

   {

      if(debug) System.out.println("Unknown file type not ascii or pdf.");

      doc = HTMLDocument.Document(f);

   }

}

catch(java.lang.InterruptedException ie)

{

   throw ie;

}

catch(java.io.IOException ioe)

{

   throw ioe;

}





Thanks in advance


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to