Hi,

I was hoping that maybe you guys could see if I'm somehow indexing
inefficiently.  I'm putting relevant parts of my code below.  I've looked at
the "benchmarks" page on Lucene and my indexing time is taking a substantial
amount of time more than what I see posted.  I'm not sure when I should call
flush() ( I saw that I should be doing that on the ImproveIndexingSpeed
page).  I'd really appreciate any advice.

Here's my code:

      File directory = new File( "/mounts/falcon5/disks/0/tcheng3/Dataset");
      File[] theFiles = directory.listFiles();

          //go through each file inside the directory and index it
          for(int curFile = 0; curFile < theFiles.length; curFile++)
          {
              File fin=theFiles[curFile];

              //open up the file
              FileInputStream inf = new FileInputStream(fin);
              InputStreamReader isr = new InputStreamReader(inf,
"US-ASCII");
              BufferedReader in = new BufferedReader(isr);
              String text="";
              String docid="";

            while (true) {

            //read in the file one line at a time, and act accordingly
                String line = in.readLine();
                if (line == null) { break;}

                 if (line.startsWith("<DOC>") ) {
                    //get docID
                    line = in.readLine();
                    String tempStr = line.substring(8,line.length());
                    int pos = tempStr.indexOf(' ');
                    docid = tempStr.substring(0,pos);
                    }else if (line.startsWith("</DOC>")) {

                    Document doc = new Document();

                      doc.add(new Field("contents",text, Field.Store.NO,
Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS ));
                    doc.add(new Field("DocID",docid, Field.Store.YES,
Field.Index.NO));
                      writer.addDocument(doc);
                    text="";
                } else {
                    text = text + "\n" + line;
                }
            }

        }


          int numIndexed = writer.docCount();

          writer.optimize();
          writer.close();


Thanks,

--JP

Reply via email to