Thanks all for the responses. I am very pleasently surprised at the helpful responses that I am getting.
Okay, I think I still haven't understood Lucene well. I am sure that I am not solving the problem the right way. So I am explaining the problem at a very high level here .. please tell me what my design should be: I have GBs of logs where each row is of the type "Col1#Col2#Col3#Col4#Col5...". Now I want to be able to search the logs for Col1 or Col2 and get all the rows containing these two columns. Now what I do is, I run a shell script to split the logs into smaller files of 1MB each, then index all the files just as the lucene example works. Then when I search for a term, I get the log-file names that contain the data. Then I buffer-read those files and find out rows containing the data. I am very sure this is a very bad way of solving the problem. There should be some way of me telling Lucene that it just needs to make sure that the two columns Col1 and Col2 can be searched, and skip the rest. Then there should be some way of telling Lucene to store the indexes in a way that a search query can result the complete row when searched for Col1 or Col2, instead of file-names containing the data. I tried to have each row as a document, but as my first mail says, I didn't get the kind of performance I wanted. I am going to run some checks (As Erick suggested). But Doron's email has made me wonder if I am doing it right at all. Can you guys please help me understand how this problem can be best solved? Thanks a lot for the help so far On 7/26/06, Mike Streeton <[EMAIL PROTECTED]> wrote:
The only way you might get the performance you want is to have multiple IndexWriters writing to different indexes and then addAll are the end. You would obviously have to handle the multi threading and distribution of the parts of the log to each writer. Mike www.ardentia.com the home of NetSearch -----Original Message----- From: Doron Cohen [mailto:[EMAIL PROTECTED] Sent: 25 July 2006 22:23 To: java-user@lucene.apache.org Subject: Re: Index Rows as Documents? Help me design a solution Few comments - > (from first posting in this thread) > The indexing was taking much more than minutes for a 1 MB log file. ... > I would expect to be able to index at least a of GB of logs within 1 or 2 minutes. 1-2 minutes per GB would be 30-60 GB/Hour, which for a single machine/jvm is a lot - well at least I did not see Lucene index this fast. > doc.add(new Field("msisdn", columns[0], Field.Store.YES, Field.Index.TOKENIZED)); > doc.add(new Field("messageid", columns[2], Field.Store.YES, Field.Index.TOKENIZED)); Is it really required to analyze the text for these fields - "msisdn" , " messageid"? > doc.add(new Field("line", line, Field.Store.YES, Field.Index.NO)); This is storing the original text of all input lines that are indexed - quite an overhead. - Doron --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]