Thanks all for the responses. I am very pleasently surprised at the helpful
responses that I am getting.

Okay, I think I still haven't understood Lucene well. I am sure that I am
not solving the problem the right way. So I am explaining the problem at a
very high level here .. please tell me what my design should be:

I have GBs of logs where each row is of the type
"Col1#Col2#Col3#Col4#Col5...". Now I want to be able to search the logs for
Col1 or Col2 and get all the rows containing these two columns.

Now what I do is, I run a shell script to split the logs into smaller files
of 1MB each, then index all the files just as the lucene example works. Then
when I search for a term, I get the log-file names that contain the data.
Then I buffer-read those files and find out rows containing the data.

I am very sure this is a very bad way of solving the problem. There should
be some way of me telling Lucene that it just needs to make sure that the
two columns Col1 and Col2 can be searched, and skip the rest. Then there
should be some way of telling Lucene to store the indexes in a way that a
search query can result the complete row when searched for Col1 or Col2,
instead of file-names containing the data.

I tried to have each row as a document, but as my first mail says, I didn't
get the kind of performance I wanted. I am going to run some checks (As
Erick suggested). But Doron's email has made me wonder if I am doing it
right at all.

Can you guys please help me understand how this problem can be best solved?

Thanks a lot for the help so far

On 7/26/06, Mike Streeton <[EMAIL PROTECTED]> wrote:

The only way you might get the performance you want is to have multiple
IndexWriters writing to different indexes and then addAll are the end.
You would obviously have to handle the multi threading and distribution
of the parts of the log to each writer.

Mike

www.ardentia.com the home of NetSearch

-----Original Message-----
From: Doron Cohen [mailto:[EMAIL PROTECTED]
Sent: 25 July 2006 22:23
To: java-user@lucene.apache.org
Subject: Re: Index Rows as Documents? Help me design a solution

Few comments -

> (from first posting in this thread)
> The indexing was taking much more than minutes for a 1 MB log file.
...
> I would expect to be able to index at least a of GB of logs within 1
or 2
minutes.

1-2 minutes per GB would be 30-60 GB/Hour, which for a single
machine/jvm
is a lot - well at least I did not see Lucene index this fast.

> doc.add(new Field("msisdn", columns[0], Field.Store.YES,
Field.Index.TOKENIZED));
> doc.add(new Field("messageid", columns[2], Field.Store.YES,
Field.Index.TOKENIZED));

Is it really required to analyze the text for these fields - "msisdn" ,
"
messageid"?

> doc.add(new Field("line", line, Field.Store.YES, Field.Index.NO));

This is storing the original text of all input lines that are indexed -
quite an overhead.

- Doron


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to