Hi,
I was hoping that maybe you guys could see if I'm somehow indexing
inefficiently. I'm putting relevant parts of my code below. I've looked at
the "benchmarks" page on Lucene and my indexing time is taking a substantial
amount of time more than what I see posted. I'm not sure when I should call
flush() ( I saw that I should be doing that on the ImproveIndexingSpeed
page). I'd really appreciate any advice.
Here's my code:
File directory = new File( "/mounts/falcon5/disks/0/tcheng3/Dataset");
File[] theFiles = directory.listFiles();
//go through each file inside the directory and index it
for(int curFile = 0; curFile < theFiles.length; curFile++)
{
File fin=theFiles[curFile];
//open up the file
FileInputStream inf = new FileInputStream(fin);
InputStreamReader isr = new InputStreamReader(inf,
"US-ASCII");
BufferedReader in = new BufferedReader(isr);
String text="";
String docid="";
while (true) {
//read in the file one line at a time, and act accordingly
String line = in.readLine();
if (line == null) { break;}
if (line.startsWith("<DOC>") ) {
//get docID
line = in.readLine();
String tempStr = line.substring(8,line.length());
int pos = tempStr.indexOf(' ');
docid = tempStr.substring(0,pos);
}else if (line.startsWith("</DOC>")) {
Document doc = new Document();
doc.add(new Field("contents",text, Field.Store.NO,
Field.Index.TOKENIZED, Field.TermVector.WITH_POSITIONS ));
doc.add(new Field("DocID",docid, Field.Store.YES,
Field.Index.NO));
writer.addDocument(doc);
text="";
} else {
text = text + "\n" + line;
}
}
}
int numIndexed = writer.docCount();
writer.optimize();
writer.close();
Thanks,
--JP