updating the index created for database search

2004-07-26 Thread lingaraju
Dear  All

I need  help  to  update the index created for the database search

I created the index with  three field  mapping to the three column of 
database(oid(primarykey),title, contents)
Then I created the document for each row and added to the writer  

doc.add(Field.Keyword(OID,oid+));
doc.add(Field.Text(title,title));
doc.add(Field.Text(contents,contents));
writer.addDocument(doc);

Here search is only on  title and the contents and oid is the key to retrieve the 
details from the database.

Later if the contents column in the database is  updated. We have to updated the 
content in the index also

If I use the writer with false

IndexWriter writer = new IndexWriter(C\index, new StandardAnalyzer(),false);

then all the record are inserted in to index without deleting the old index causing 
duplication 


If I use the writer with true

IndexWriter writer = new IndexWriter(C\index, new StandardAnalyzer(),false);

then record are inserted in to index deleting all the old index.

My question is 
1) How to update the existing index 
2) When I fetch the rows from the database in order to update or insert in index how 
to know which record is modified in database and which record is not present is index

Thanks is advance
Raju


Re: updating the index created for database search

2004-07-26 Thread Daniel Naber
On Monday 26 July 2004 11:37, lingaraju wrote:

 2) When I fetch the rows from the database in order to update or insert in
 index how to know which record is modified in database and which record is
 not present is index

Your database will need a last modified column. Then you can select those 
rows that have been modified since the last update and for each row check if 
it's in the Lucene index. If it is, delete it there and re-add the new 
version. If it's not, add it. To delete documents you will probably need to 
iterate over all your IDs in the Lucene index and check if they are still in 
the database. If that's too inefficient you could check if you can do it the 
way the file system indexer (IndexHTML in Lucene's demo) does it.

BTW, please don't cross-post to both lists.

Regards
 Daniel
 
-- 
Daniel Naber, IntraFind Software AG, Tel. 089-8906 9700


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Logic of score method in hits class

2004-07-26 Thread lingaraju
Dear  All

How the score method works(logic) in Hits class
For 100% match also score is returning only 69% 

Thanks and regards
Raju


Re: updating the index created for database search

2004-07-26 Thread lingaraju
Dear Daniel

Thanks a lot.
I do have the last-modified column in my database.
But how to know how many records are modified.
If it is new record  through which class we have to check that record is
present in the index
In the mean time I will look into IndexHTML in lucene demo

Regards
Raju

- Original Message - 
From: Daniel Naber [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, July 26, 2004 3:35 PM
Subject: Re: updating the index created for database search


 On Monday 26 July 2004 11:37, lingaraju wrote:

  2) When I fetch the rows from the database in order to update or insert
in
  index how to know which record is modified in database and which record
is
  not present is index

 Your database will need a last modified column. Then you can select
those
 rows that have been modified since the last update and for each row check
if
 it's in the Lucene index. If it is, delete it there and re-add the new
 version. If it's not, add it. To delete documents you will probably need
to
 iterate over all your IDs in the Lucene index and check if they are still
in
 the database. If that's too inefficient you could check if you can do it
the
 way the file system indexer (IndexHTML in Lucene's demo) does it.

 BTW, please don't cross-post to both lists.

 Regards
  Daniel

 -- 
 Daniel Naber, IntraFind Software AG, Tel. 089-8906 9700


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: updating the index created for database search

2004-07-26 Thread Daniel Naber
On Monday 26 July 2004 13:31, lingaraju wrote:

 If it is new record  through which class we have to check that record is
 present in the index

Just search for the id with a TermQuery. If you get a hit, the record is in 
the index already.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: updating the index created for database search

2004-07-26 Thread lingaraju
Dear Daniel

Thakns
Secod part is ok  What about the first part I mean how to know how many
records are modified

Regards
Raju

- Original Message - 
From: Daniel Naber [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Monday, July 26, 2004 5:21 PM
Subject: Re: updating the index created for database search


 On Monday 26 July 2004 13:31, lingaraju wrote:

  If it is new record  through which class we have to check that record is
  present in the index

 Just search for the id with a TermQuery. If you get a hit, the record is
in
 the index already.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Anyone use MultiSearcher class

2004-07-26 Thread Tea Yu
Mark,

I'm also planning a distributed index system.  After reading some code, I
think it's more efficient to get rid of Hits and work directly with TopDocs
returned by ParallelMultiSearcher.search(), I dun need the cache anyway as I
dun need stateful netvigation.

Another question is - does each Hits.doc(i) lead to an obj
serialization/traffic/deserialization?  Do we need a ValueListHolder to
optimize that?

I also wonder why many search() methods don't throw RemoteExceptions, any
idea?

Thanks
Tea

 Don, I think I finally understand your problem -- and mine -- with
 MultiSearcher. I had tested an implementation of my system using
 ParallelMultiSearcher to split a huge index over many computers.
 I was very impressed by the results on my test data, but alarmed
 after a trial with live data :)

 Consider MultiSearcher.search(Query Q). Suppose that Q aggregated
 over ALL the Searchables in the MultiSearcher would return 1000
 documents. But, the Hits object created by search() will only cache
 the first 100 documents. When Hits.doc(101) is called, Hits will
 cache 200 documents -- then 400, 800, 1600 and so on. How does Hits
 get these extra documents? By calling the MultiSearcher again.

 Now consider a MultiSearcher as described above with 2 Searchables.
 With respect to Q, Searchable S has 1000 documents, Searchable T
 has zero. So to fetch the 101st document, not only is S searched,
 but T is too, even though the result of Q applied to T is still zero
 and will always be zero. The same thing will happen when fetching
 the 201st, 401st and 801st document.

 This accounts for my slow performance, and I think yours too. That
 your observed degradation is a power of 2 is a clue.

 My performance is especially vulnerable because slave Searchables
 in the MultiSearcher are Remote -- accessed via RMI.

 I guess I have to code smarter around MultiSearcher. One problem
 you highlight is that Hits is final -- so it is not possible even to
 modify the 100/200/400 cache size logic.

 Any ideas from anyone would be much appreciated.

 Mark Florence
 CTO, AIRS
 800-897-7714 x 1703
 [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Anyone use MultiSearcher class

2004-07-26 Thread Don Vaillancourt
Thanks for the info.
Maybe the best solution to this may be to perform multiple individual 
searches, create a container class and store all the hits sorted by 
relevance within that class and then cache/serialize this result for the 
current search for page by page manipulation.

At 09:46 AM 15/07/2004, Mark Florence wrote:
Don, I think I finally understand your problem -- and mine -- with
MultiSearcher. I had tested an implementation of my system using
ParallelMultiSearcher to split a huge index over many computers.
I was very impressed by the results on my test data, but alarmed
after a trial with live data :)
Consider MultiSearcher.search(Query Q). Suppose that Q aggregated
over ALL the Searchables in the MultiSearcher would return 1000
documents. But, the Hits object created by search() will only cache
the first 100 documents. When Hits.doc(101) is called, Hits will
cache 200 documents -- then 400, 800, 1600 and so on. How does Hits
get these extra documents? By calling the MultiSearcher again.
Now consider a MultiSearcher as described above with 2 Searchables.
With respect to Q, Searchable S has 1000 documents, Searchable T
has zero. So to fetch the 101st document, not only is S searched,
but T is too, even though the result of Q applied to T is still zero
and will always be zero. The same thing will happen when fetching
the 201st, 401st and 801st document.
This accounts for my slow performance, and I think yours too. That
your observed degradation is a power of 2 is a clue.
My performance is especially vulnerable because slave Searchables
in the MultiSearcher are Remote -- accessed via RMI.
I guess I have to code smarter around MultiSearcher. One problem
you highlight is that Hits is final -- so it is not possible even to
modify the 100/200/400 cache size logic.
Any ideas from anyone would be much appreciated.
Mark Florence
CTO, AIRS
800-897-7714 x 1703
[EMAIL PROTECTED]

-Original Message-
From: Don Vaillancourt [mailto:[EMAIL PROTECTED]
Sent: Monday, July 12, 2004 12:36 pm
To: Lucene Users List
Subject: Anyone use MultiSearcher class
Hello,
Has anyone used the Multisearcher class?
I have noticed that searching two indexes using this MultiSearcher class
takes 8 times longer than searching only one index.  I could understand if
it took 3 to 4 times longer to search due to sorting the two search results
and stuff, but why 8 times longer.
Is there some optimization that can be done to hasten the search?  Or
should I just write my own MultiSearcher.  The problem though is that there
is no way for me to create my own Hits object (no methods are available and
the class is final).
Anyone have any clue?
Thanks
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.







Matching

2004-07-26 Thread Akmal Sarhan
Hallo,

I have documents that only have numeric values (and dates) and I want to
be able to do the following:

given e.g that the document represents a  Person
the fields are age,nr_of_children,last_login_date

I want to boost those with the oldest age to have a better score for
example but in conjunction with other criteria (therefore the new Sort
will not help I guess)

I can not set the boost at indexing time because I might want the ones
with less children for example to have a better score at searching time

what should be done to achieve this kind of search

thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Boosting documents

2004-07-26 Thread Rob Clews
I want to do the same, set a boost for a field containing a date that
lowers as the date is further from now, is there any way I could do
this?

Also when I set a document boost at index time, with doc.setBoost(2);
then retrieve it via doc.getBoost() I always seem to get 1.0, even
though I can tell from a search that the boost works correctly. I
realise the docs say that the returned value may not be the same as the
indexed value, but should I always get 1? Essentially I'm trying to
allow an administrator to set the boost on the document through my
webapp.

Thanks

On Mon, 2004-07-26 at 17:17 +0200, Akmal Sarhan wrote:
 I want to boost those with the oldest age to have a better score for
 example but in conjunction with other criteria (therefore the new Sort
 will not help I guess)

-- 
Rob Clews
Klear Systems Ltd
t: +44 (0)121 707 8558 e: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



over 300 GB to index: feasability and performance issue

2004-07-26 Thread Vincent Le Maout
Hi everyone,
I have to index a huge, huge amount of data: about 10 million documents
making up about 300 GB. Is there any technical limitation in Lucene that
could prevent me from processing such amount (I mean, of course, apart
from the external limits induce by the hardware: RAM, disks, the system,
whatever) ? If possible, does anyone have an idea of the amount of resource
needed: RAM, CPU time, size of indexes, access time on such a collection ?
if not, is it possible to extrapolate an estimation from previous 
benchmarks ?

Thanks in advance.
Regards.
Vincent Le Maout
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Anyone use MultiSearcher class

2004-07-26 Thread Mark Florence
Don, at the low level, the issue isn't necessarily caching results
from page-to-page (as viewed by some UI.) Such a cache would need to
be co-ordinated with index writes.

Rather, I plan to focus on the way Hits first reads 100 hits, then 200,
then 400 and so on -- but all Hits knows about is the MultiSearcher.
This means that in order to find the 101st hit, Hits effectively asks
ALL the searchers in the MultiSearcher to search again -- even though it
could be known that SOME of those searchers are incapable of returning
results.

-- Mark Florence

-Original Message-
From: Don Vaillancourt [mailto:[EMAIL PROTECTED]
Sent: Monday, July 26, 2004 11:06 am
To: Lucene Users List; Lucene Users List
Subject: RE: Anyone use MultiSearcher class


Thanks for the info.

Maybe the best solution to this may be to perform multiple individual
searches, create a container class and store all the hits sorted by
relevance within that class and then cache/serialize this result for the
current search for page by page manipulation.


At 09:46 AM 15/07/2004, Mark Florence wrote:
Don, I think I finally understand your problem -- and mine -- with
MultiSearcher. I had tested an implementation of my system using
ParallelMultiSearcher to split a huge index over many computers.
I was very impressed by the results on my test data, but alarmed
after a trial with live data :)

Consider MultiSearcher.search(Query Q). Suppose that Q aggregated
over ALL the Searchables in the MultiSearcher would return 1000
documents. But, the Hits object created by search() will only cache
the first 100 documents. When Hits.doc(101) is called, Hits will
cache 200 documents -- then 400, 800, 1600 and so on. How does Hits
get these extra documents? By calling the MultiSearcher again.

Now consider a MultiSearcher as described above with 2 Searchables.
With respect to Q, Searchable S has 1000 documents, Searchable T
has zero. So to fetch the 101st document, not only is S searched,
but T is too, even though the result of Q applied to T is still zero
and will always be zero. The same thing will happen when fetching
the 201st, 401st and 801st document.

This accounts for my slow performance, and I think yours too. That
your observed degradation is a power of 2 is a clue.

My performance is especially vulnerable because slave Searchables
in the MultiSearcher are Remote -- accessed via RMI.

I guess I have to code smarter around MultiSearcher. One problem
you highlight is that Hits is final -- so it is not possible even to
modify the 100/200/400 cache size logic.

Any ideas from anyone would be much appreciated.

Mark Florence
CTO, AIRS
800-897-7714 x 1703
[EMAIL PROTECTED]




-Original Message-
From: Don Vaillancourt [mailto:[EMAIL PROTECTED]
Sent: Monday, July 12, 2004 12:36 pm
To: Lucene Users List
Subject: Anyone use MultiSearcher class


Hello,

Has anyone used the Multisearcher class?

I have noticed that searching two indexes using this MultiSearcher class
takes 8 times longer than searching only one index.  I could understand if
it took 3 to 4 times longer to search due to sorting the two search results
and stuff, but why 8 times longer.

Is there some optimization that can be done to hasten the search?  Or
should I just write my own MultiSearcher.  The problem though is that there
is no way for me to create my own Hits object (no methods are available and
the class is final).

Anyone have any clue?

Thanks


Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.













-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.














RE: Anyone use MultiSearcher class

2004-07-26 Thread Don Vaillancourt
Eh Mark,
Are you involved with Lucene development?
At 11:39 AM 26/07/2004, you wrote:
Don, at the low level, the issue isn't necessarily caching results
from page-to-page (as viewed by some UI.) Such a cache would need to
be co-ordinated with index writes.
Rather, I plan to focus on the way Hits first reads 100 hits, then 200,
then 400 and so on -- but all Hits knows about is the MultiSearcher.
This means that in order to find the 101st hit, Hits effectively asks
ALL the searchers in the MultiSearcher to search again -- even though it
could be known that SOME of those searchers are incapable of returning
results.
-- Mark Florence
-Original Message-
From: Don Vaillancourt [mailto:[EMAIL PROTECTED]
Sent: Monday, July 26, 2004 11:06 am
To: Lucene Users List; Lucene Users List
Subject: RE: Anyone use MultiSearcher class
Thanks for the info.
Maybe the best solution to this may be to perform multiple individual
searches, create a container class and store all the hits sorted by
relevance within that class and then cache/serialize this result for the
current search for page by page manipulation.
At 09:46 AM 15/07/2004, Mark Florence wrote:
Don, I think I finally understand your problem -- and mine -- with
MultiSearcher. I had tested an implementation of my system using
ParallelMultiSearcher to split a huge index over many computers.
I was very impressed by the results on my test data, but alarmed
after a trial with live data :)

Consider MultiSearcher.search(Query Q). Suppose that Q aggregated
over ALL the Searchables in the MultiSearcher would return 1000
documents. But, the Hits object created by search() will only cache
the first 100 documents. When Hits.doc(101) is called, Hits will
cache 200 documents -- then 400, 800, 1600 and so on. How does Hits
get these extra documents? By calling the MultiSearcher again.

Now consider a MultiSearcher as described above with 2 Searchables.
With respect to Q, Searchable S has 1000 documents, Searchable T
has zero. So to fetch the 101st document, not only is S searched,
but T is too, even though the result of Q applied to T is still zero
and will always be zero. The same thing will happen when fetching
the 201st, 401st and 801st document.

This accounts for my slow performance, and I think yours too. That
your observed degradation is a power of 2 is a clue.

My performance is especially vulnerable because slave Searchables
in the MultiSearcher are Remote -- accessed via RMI.

I guess I have to code smarter around MultiSearcher. One problem
you highlight is that Hits is final -- so it is not possible even to
modify the 100/200/400 cache size logic.

Any ideas from anyone would be much appreciated.

Mark Florence
CTO, AIRS
800-897-7714 x 1703
[EMAIL PROTECTED]




-Original Message-
From: Don Vaillancourt [mailto:[EMAIL PROTECTED]
Sent: Monday, July 12, 2004 12:36 pm
To: Lucene Users List
Subject: Anyone use MultiSearcher class


Hello,

Has anyone used the Multisearcher class?

I have noticed that searching two indexes using this MultiSearcher class
takes 8 times longer than searching only one index.  I could understand if
it took 3 to 4 times longer to search due to sorting the two search results
and stuff, but why 8 times longer.

Is there some optimization that can be done to hasten the search?  Or
should I just write my own MultiSearcher.  The problem though is that there
is no way for me to create my own Hits object (no methods are available and
the class is final).

Anyone have any clue?

Thanks


Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of the recipient.













-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com

This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended recipient please
notify the sender by reply email and immediately delete
this email. Use, disclosure or reproduction of this email
by anyone other than the intended recipient(s) is strictly
prohibited. No representation is made that this email or
any attachments are free of viruses. Virus scanning is
recommended and is the responsibility of 

RE: Anyone use MultiSearcher class

2004-07-26 Thread Mark Florence
'Fraid not! Just a humble user :)

-- Mark

-Original Message-
From: Don Vaillancourt [mailto:[EMAIL PROTECTED]
Sent: Monday, July 26, 2004 12:14 pm
To: Lucene Users List
Subject: RE: Anyone use MultiSearcher class


Eh Mark,

Are you involved with Lucene development?


At 11:39 AM 26/07/2004, you wrote:
Don, at the low level, the issue isn't necessarily caching results
from page-to-page (as viewed by some UI.) Such a cache would need to
be co-ordinated with index writes.

Rather, I plan to focus on the way Hits first reads 100 hits, then 200,
then 400 and so on -- but all Hits knows about is the MultiSearcher.
This means that in order to find the 101st hit, Hits effectively asks
ALL the searchers in the MultiSearcher to search again -- even though it
could be known that SOME of those searchers are incapable of returning
results.

-- Mark Florence

-Original Message-
From: Don Vaillancourt [mailto:[EMAIL PROTECTED]
Sent: Monday, July 26, 2004 11:06 am
To: Lucene Users List; Lucene Users List
Subject: RE: Anyone use MultiSearcher class


Thanks for the info.

Maybe the best solution to this may be to perform multiple individual
searches, create a container class and store all the hits sorted by
relevance within that class and then cache/serialize this result for the
current search for page by page manipulation.


At 09:46 AM 15/07/2004, Mark Florence wrote:
 Don, I think I finally understand your problem -- and mine -- with
 MultiSearcher. I had tested an implementation of my system using
 ParallelMultiSearcher to split a huge index over many computers.
 I was very impressed by the results on my test data, but alarmed
 after a trial with live data :)
 
 Consider MultiSearcher.search(Query Q). Suppose that Q aggregated
 over ALL the Searchables in the MultiSearcher would return 1000
 documents. But, the Hits object created by search() will only cache
 the first 100 documents. When Hits.doc(101) is called, Hits will
 cache 200 documents -- then 400, 800, 1600 and so on. How does Hits
 get these extra documents? By calling the MultiSearcher again.
 
 Now consider a MultiSearcher as described above with 2 Searchables.
 With respect to Q, Searchable S has 1000 documents, Searchable T
 has zero. So to fetch the 101st document, not only is S searched,
 but T is too, even though the result of Q applied to T is still zero
 and will always be zero. The same thing will happen when fetching
 the 201st, 401st and 801st document.
 
 This accounts for my slow performance, and I think yours too. That
 your observed degradation is a power of 2 is a clue.
 
 My performance is especially vulnerable because slave Searchables
 in the MultiSearcher are Remote -- accessed via RMI.
 
 I guess I have to code smarter around MultiSearcher. One problem
 you highlight is that Hits is final -- so it is not possible even to
 modify the 100/200/400 cache size logic.
 
 Any ideas from anyone would be much appreciated.
 
 Mark Florence
 CTO, AIRS
 800-897-7714 x 1703
 [EMAIL PROTECTED]
 
 
 
 
 -Original Message-
 From: Don Vaillancourt [mailto:[EMAIL PROTECTED]
 Sent: Monday, July 12, 2004 12:36 pm
 To: Lucene Users List
 Subject: Anyone use MultiSearcher class
 
 
 Hello,
 
 Has anyone used the Multisearcher class?
 
 I have noticed that searching two indexes using this MultiSearcher class
 takes 8 times longer than searching only one index.  I could understand
if
 it took 3 to 4 times longer to search due to sorting the two search
results
 and stuff, but why 8 times longer.
 
 Is there some optimization that can be done to hasten the search?  Or
 should I just write my own MultiSearcher.  The problem though is that
there
 is no way for me to create my own Hits object (no methods are available
and
 the class is final).
 
 Anyone have any clue?
 
 Thanks
 
 
 Don Vaillancourt
 Director of Software Development
 
 WEB IMPACT INC.
 416-815-2000 ext. 245
 email: [EMAIL PROTECTED]
 web: http://www.web-impact.com
 
 
 
 
 This email message is intended only for the addressee(s)
 and contains information that may be confidential and/or
 copyright.  If you are not the intended recipient please
 notify the sender by reply email and immediately delete
 this email. Use, disclosure or reproduction of this email
 by anyone other than the intended recipient(s) is strictly
 prohibited. No representation is made that this email or
 any attachments are free of viruses. Virus scanning is
 recommended and is the responsibility of the recipient.
 
 
 
 
 
 
 
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]

Don Vaillancourt
Director of Software Development

WEB IMPACT INC.
416-815-2000 ext. 245
email: [EMAIL PROTECTED]
web: http://www.web-impact.com




This email message is intended only for the addressee(s)
and contains information that may be confidential and/or
copyright.  If you are not the intended 

Re: Logic of score method in hits class

2004-07-26 Thread Doug Cutting
Lucene scores are not percentages.  They really only make sense compared 
to other scores for the same query.  If you like percentages, you can 
divide all scores by the first score and multiply by 100.

Doug
lingaraju wrote:
Dear  All
How the score method works(logic) in Hits class
For 100% match also score is returning only 69% 

Thanks and regards
Raju
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Boosting documents

2004-07-26 Thread Doug Cutting
Rob Clews wrote:
I want to do the same, set a boost for a field containing a date that
lowers as the date is further from now, is there any way I could do
this?
You could implement Similarity.idf(Term, Searcher) to, when 
Term.field().equals(date), return a value that is greater for more 
recent dates.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: over 300 GB to index: feasability and performance issue

2004-07-26 Thread Doug Cutting
Vincent Le Maout wrote:
I have to index a huge, huge amount of data: about 10 million documents
making up about 300 GB. Is there any technical limitation in Lucene that
could prevent me from processing such amount (I mean, of course, apart
from the external limits induce by the hardware: RAM, disks, the system,
whatever) ?
Lucene is in theory able to support up to 2B documents in a single 
index.  Folks have sucessfully built indexes with several hundred 
million documents.  10 million should not be a problem.

If possible, does anyone have an idea of the amount of resource
needed: RAM, CPU time, size of indexes, access time on such a collection ?
if not, is it possible to extrapolate an estimation from previous 
benchmarks ?
For simple 2-3 term queries, with average sized documents (~10k of text) 
you should get decent performance (1 second / query) on a 10M document 
index.  An index typically requires around 35% of the plain text size.

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Caching of TermDocs

2004-07-26 Thread John Patterson
Is there any way to cache TermDocs?  Is this a good idea?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Caching of TermDocs

2004-07-26 Thread Paul Elschot
On Monday 26 July 2004 21:41, John Patterson wrote:

 Is there any way to cache TermDocs?  Is this a good idea?

Lucene does this internally by buffering
up to 32 document numbers in advance for a query Term.
You can view the details here in case you're interested:
http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/TermScorer.java
It uses the TermDocs.read() method to fill a buffer of document numbers.

Is this what you had in mind?

Regards,
Paul


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Highlighter package updated with overlapping token support

2004-07-26 Thread markharw00d
I have updated the Highlighter code in CVS to support tokenizers that generate 
overlapping tokens.

The Junit test rig has a new example test that uses a SynonymTokenizer which 
generates multiple tokens 
in the same position for the same input token eg (the token football is expanded 
into tokens soccer,footie and football). 
The Formatter interface had to be changed to take a new TokenGroup object instead of 
a single token but
I doubt any code changes in clients are required because most people use the default 
Formatter implementation and haven't
created their own  implementations.

Cheers
Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Zilverline release candidate 1.0-rc4 available

2004-07-26 Thread Zilverline info
All,
I've just released a new candidate (*1.0-rc4*) New features
include Spanish GUI, RTF support, searching on date range,
customizable boosting factors, and configurable analyzers per 
collection. Zilverline now generates a MD5 Hash per file,
and prevents duplicate files from being added more than once.

Zilverline supports plugins. You can create your own extractors
for various file formats. I've provided Extractors for RTF, Text, PDF, 
Word, and HTML.

Zilverline supports collections. A collection is a set of files and 
directories in a directory. A collection can be indexed, and searched. 
The results of the search can be retrieved from local disk or remotely, 
if you run a webserver on your machine. Files inside zip, rar and chm 
files are extracted, indexed and can be cached. The cache can be mapped 
to sit behind your webserver as well.

It's also possible to specify your own handlers for archives. Say you
have a RAR archive, and you have a program on your system that can
extract the content from it, then you can specify that Zilverline should
use this program.
Zilverline is an free search engine based on lucene that's ready to
roll, and can be simply dropped in a Servlet Engine. It runs out of the 
box, and supports PDF, WORD, HTM, TXT, and
CHM, and can  index zip, rar, and many other formats.
Both on Windows and Linux.

Please take look at http://www.zilverline.org, and have a swing at it.
cheers,
  Michael Franken

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Phrase Query

2004-07-26 Thread Hetan Shah
Hello,
Can someone on the mailing list send me a copy of sample code of how to 
implement the phrase query for my search. Regular Query is working fine, 
but the Phrase Query does not seem to work.

TIA,
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Highlighter package updated with overlapping token support

2004-07-26 Thread Karthik N S
Hi
   Mark

 Apologies


  Please   Casn u Provide the URL for the Users to Dwnload the new
version
 of Highlighter package ( jar / Zip  format) from u'r main website page.

 [ Because some of the developers may not have access to
 CVS downloading (Organization restrictions) from Lucene - sandbox ]



Thx in advance

with regards
Karthik

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 27, 2004 2:28 AM
To: [EMAIL PROTECTED]
Subject: Highlighter package updated with overlapping token support


I have updated the Highlighter code in CVS to support tokenizers that
generate overlapping tokens.

The Junit test rig has a new example test that uses a SynonymTokenizer
which generates multiple tokens
in the same position for the same input token eg (the token football is
expanded into tokens soccer,footie and football).
The Formatter interface had to be changed to take a new TokenGroup object
instead of a single token but
I doubt any code changes in clients are required because most people use the
default Formatter implementation and haven't
created their own  implementations.

Cheers
Mark

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Phrase Query

2004-07-26 Thread Erik Hatcher
Let's turn it around could you send us your code that is not 
working?

Lucene's test cases show PhraseQuery in action, and working.
Erik
On Jul 26, 2004, at 4:11 PM, Hetan Shah wrote:
Hello,
Can someone on the mailing list send me a copy of sample code of how 
to implement the phrase query for my search. Regular Query is working 
fine, but the Phrase Query does not seem to work.

TIA,
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]