Re: Index Rows as Documents? Help me design a solution

Erick Erickson Wed, 26 Jul 2006 10:51:04 -0700

It feels to me like you're major problem might be file IO with all those
files. There's no need to split the files up first and then index the files.
Just read through the log and index each row. The code fragment you posted
should allow you to get the line back from the "line" field of each
document. That way, when you execute searches, you also avoid the file I/O.
Once indexed, you shouldn't have to go back to the disk to get the
data......


This scares me...

*********************
....Then
when I search for a term, I get the log-file names that contain the data.
Then I buffer-read those files and find out rows containing the data....
*********************

According to the fragment you posted, you already have the data in the index
and don't need to go to disk and get it. Which could be a major issue.....

Are we still talking about indexing speed or have we segued to search speed?
Jeremy's suggestion about MaxBufferedDocs is well heeded.....

One final confusion.... Watch the Hits object when searching. The hits
object is built for getting the first few hits (100 or so). If you iterate
over more documents than that, the Hits object will re-execute your query
every 100 or so documents. You'll want to think about a HitCollector or
TopDocs object for large result sets. Again, the quick timing thing is to
cut off your responses at, say, 50 and see how long the search takes as
opposed to collecting all responses. Ditto for the file I/O. If we're
talking about the search speed, try it without reading the files....


Keep us posted
Erick

Re: Index Rows as Documents? Help me design a solution

Reply via email to