updating the index created for database search
Dear All I need help to update the index created for the database search I created the index with three field mapping to the three column of database(oid(primarykey),title, contents) Then I created the document for each row and added to the writer doc.add(Field.Keyword(OID,oid+)); doc.add(Field.Text(title,title)); doc.add(Field.Text(contents,contents)); writer.addDocument(doc); Here search is only on title and the contents and oid is the key to retrieve the details from the database. Later if the contents column in the database is updated. We have to updated the content in the index also If I use the writer with false IndexWriter writer = new IndexWriter(C\index, new StandardAnalyzer(),false); then all the record are inserted in to index without deleting the old index causing duplication If I use the writer with true IndexWriter writer = new IndexWriter(C\index, new StandardAnalyzer(),false); then record are inserted in to index deleting all the old index. My question is 1) How to update the existing index 2) When I fetch the rows from the database in order to update or insert in index how to know which record is modified in database and which record is not present is index Thanks is advance Raju
Re: updating the index created for database search
On Monday 26 July 2004 11:37, lingaraju wrote: 2) When I fetch the rows from the database in order to update or insert in index how to know which record is modified in database and which record is not present is index Your database will need a last modified column. Then you can select those rows that have been modified since the last update and for each row check if it's in the Lucene index. If it is, delete it there and re-add the new version. If it's not, add it. To delete documents you will probably need to iterate over all your IDs in the Lucene index and check if they are still in the database. If that's too inefficient you could check if you can do it the way the file system indexer (IndexHTML in Lucene's demo) does it. BTW, please don't cross-post to both lists. Regards Daniel -- Daniel Naber, IntraFind Software AG, Tel. 089-8906 9700 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Logic of score method in hits class
Dear All How the score method works(logic) in Hits class For 100% match also score is returning only 69% Thanks and regards Raju
Re: updating the index created for database search
Dear Daniel Thanks a lot. I do have the last-modified column in my database. But how to know how many records are modified. If it is new record through which class we have to check that record is present in the index In the mean time I will look into IndexHTML in lucene demo Regards Raju - Original Message - From: Daniel Naber [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, July 26, 2004 3:35 PM Subject: Re: updating the index created for database search On Monday 26 July 2004 11:37, lingaraju wrote: 2) When I fetch the rows from the database in order to update or insert in index how to know which record is modified in database and which record is not present is index Your database will need a last modified column. Then you can select those rows that have been modified since the last update and for each row check if it's in the Lucene index. If it is, delete it there and re-add the new version. If it's not, add it. To delete documents you will probably need to iterate over all your IDs in the Lucene index and check if they are still in the database. If that's too inefficient you could check if you can do it the way the file system indexer (IndexHTML in Lucene's demo) does it. BTW, please don't cross-post to both lists. Regards Daniel -- Daniel Naber, IntraFind Software AG, Tel. 089-8906 9700 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: updating the index created for database search
On Monday 26 July 2004 13:31, lingaraju wrote: If it is new record through which class we have to check that record is present in the index Just search for the id with a TermQuery. If you get a hit, the record is in the index already. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: updating the index created for database search
Dear Daniel Thakns Secod part is ok What about the first part I mean how to know how many records are modified Regards Raju - Original Message - From: Daniel Naber [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Monday, July 26, 2004 5:21 PM Subject: Re: updating the index created for database search On Monday 26 July 2004 13:31, lingaraju wrote: If it is new record through which class we have to check that record is present in the index Just search for the id with a TermQuery. If you get a hit, the record is in the index already. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Anyone use MultiSearcher class
Mark, I'm also planning a distributed index system. After reading some code, I think it's more efficient to get rid of Hits and work directly with TopDocs returned by ParallelMultiSearcher.search(), I dun need the cache anyway as I dun need stateful netvigation. Another question is - does each Hits.doc(i) lead to an obj serialization/traffic/deserialization? Do we need a ValueListHolder to optimize that? I also wonder why many search() methods don't throw RemoteExceptions, any idea? Thanks Tea Don, I think I finally understand your problem -- and mine -- with MultiSearcher. I had tested an implementation of my system using ParallelMultiSearcher to split a huge index over many computers. I was very impressed by the results on my test data, but alarmed after a trial with live data :) Consider MultiSearcher.search(Query Q). Suppose that Q aggregated over ALL the Searchables in the MultiSearcher would return 1000 documents. But, the Hits object created by search() will only cache the first 100 documents. When Hits.doc(101) is called, Hits will cache 200 documents -- then 400, 800, 1600 and so on. How does Hits get these extra documents? By calling the MultiSearcher again. Now consider a MultiSearcher as described above with 2 Searchables. With respect to Q, Searchable S has 1000 documents, Searchable T has zero. So to fetch the 101st document, not only is S searched, but T is too, even though the result of Q applied to T is still zero and will always be zero. The same thing will happen when fetching the 201st, 401st and 801st document. This accounts for my slow performance, and I think yours too. That your observed degradation is a power of 2 is a clue. My performance is especially vulnerable because slave Searchables in the MultiSearcher are Remote -- accessed via RMI. I guess I have to code smarter around MultiSearcher. One problem you highlight is that Hits is final -- so it is not possible even to modify the 100/200/400 cache size logic. Any ideas from anyone would be much appreciated. Mark Florence CTO, AIRS 800-897-7714 x 1703 [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Anyone use MultiSearcher class
Thanks for the info. Maybe the best solution to this may be to perform multiple individual searches, create a container class and store all the hits sorted by relevance within that class and then cache/serialize this result for the current search for page by page manipulation. At 09:46 AM 15/07/2004, Mark Florence wrote: Don, I think I finally understand your problem -- and mine -- with MultiSearcher. I had tested an implementation of my system using ParallelMultiSearcher to split a huge index over many computers. I was very impressed by the results on my test data, but alarmed after a trial with live data :) Consider MultiSearcher.search(Query Q). Suppose that Q aggregated over ALL the Searchables in the MultiSearcher would return 1000 documents. But, the Hits object created by search() will only cache the first 100 documents. When Hits.doc(101) is called, Hits will cache 200 documents -- then 400, 800, 1600 and so on. How does Hits get these extra documents? By calling the MultiSearcher again. Now consider a MultiSearcher as described above with 2 Searchables. With respect to Q, Searchable S has 1000 documents, Searchable T has zero. So to fetch the 101st document, not only is S searched, but T is too, even though the result of Q applied to T is still zero and will always be zero. The same thing will happen when fetching the 201st, 401st and 801st document. This accounts for my slow performance, and I think yours too. That your observed degradation is a power of 2 is a clue. My performance is especially vulnerable because slave Searchables in the MultiSearcher are Remote -- accessed via RMI. I guess I have to code smarter around MultiSearcher. One problem you highlight is that Hits is final -- so it is not possible even to modify the 100/200/400 cache size logic. Any ideas from anyone would be much appreciated. Mark Florence CTO, AIRS 800-897-7714 x 1703 [EMAIL PROTECTED] -Original Message- From: Don Vaillancourt [mailto:[EMAIL PROTECTED] Sent: Monday, July 12, 2004 12:36 pm To: Lucene Users List Subject: Anyone use MultiSearcher class Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
Matching
Hallo, I have documents that only have numeric values (and dates) and I want to be able to do the following: given e.g that the document represents a Person the fields are age,nr_of_children,last_login_date I want to boost those with the oldest age to have a better score for example but in conjunction with other criteria (therefore the new Sort will not help I guess) I can not set the boost at indexing time because I might want the ones with less children for example to have a better score at searching time what should be done to achieve this kind of search thanks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Boosting documents
I want to do the same, set a boost for a field containing a date that lowers as the date is further from now, is there any way I could do this? Also when I set a document boost at index time, with doc.setBoost(2); then retrieve it via doc.getBoost() I always seem to get 1.0, even though I can tell from a search that the boost works correctly. I realise the docs say that the returned value may not be the same as the indexed value, but should I always get 1? Essentially I'm trying to allow an administrator to set the boost on the document through my webapp. Thanks On Mon, 2004-07-26 at 17:17 +0200, Akmal Sarhan wrote: I want to boost those with the oldest age to have a better score for example but in conjunction with other criteria (therefore the new Sort will not help I guess) -- Rob Clews Klear Systems Ltd t: +44 (0)121 707 8558 e: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
over 300 GB to index: feasability and performance issue
Hi everyone, I have to index a huge, huge amount of data: about 10 million documents making up about 300 GB. Is there any technical limitation in Lucene that could prevent me from processing such amount (I mean, of course, apart from the external limits induce by the hardware: RAM, disks, the system, whatever) ? If possible, does anyone have an idea of the amount of resource needed: RAM, CPU time, size of indexes, access time on such a collection ? if not, is it possible to extrapolate an estimation from previous benchmarks ? Thanks in advance. Regards. Vincent Le Maout - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Anyone use MultiSearcher class
Don, at the low level, the issue isn't necessarily caching results from page-to-page (as viewed by some UI.) Such a cache would need to be co-ordinated with index writes. Rather, I plan to focus on the way Hits first reads 100 hits, then 200, then 400 and so on -- but all Hits knows about is the MultiSearcher. This means that in order to find the 101st hit, Hits effectively asks ALL the searchers in the MultiSearcher to search again -- even though it could be known that SOME of those searchers are incapable of returning results. -- Mark Florence -Original Message- From: Don Vaillancourt [mailto:[EMAIL PROTECTED] Sent: Monday, July 26, 2004 11:06 am To: Lucene Users List; Lucene Users List Subject: RE: Anyone use MultiSearcher class Thanks for the info. Maybe the best solution to this may be to perform multiple individual searches, create a container class and store all the hits sorted by relevance within that class and then cache/serialize this result for the current search for page by page manipulation. At 09:46 AM 15/07/2004, Mark Florence wrote: Don, I think I finally understand your problem -- and mine -- with MultiSearcher. I had tested an implementation of my system using ParallelMultiSearcher to split a huge index over many computers. I was very impressed by the results on my test data, but alarmed after a trial with live data :) Consider MultiSearcher.search(Query Q). Suppose that Q aggregated over ALL the Searchables in the MultiSearcher would return 1000 documents. But, the Hits object created by search() will only cache the first 100 documents. When Hits.doc(101) is called, Hits will cache 200 documents -- then 400, 800, 1600 and so on. How does Hits get these extra documents? By calling the MultiSearcher again. Now consider a MultiSearcher as described above with 2 Searchables. With respect to Q, Searchable S has 1000 documents, Searchable T has zero. So to fetch the 101st document, not only is S searched, but T is too, even though the result of Q applied to T is still zero and will always be zero. The same thing will happen when fetching the 201st, 401st and 801st document. This accounts for my slow performance, and I think yours too. That your observed degradation is a power of 2 is a clue. My performance is especially vulnerable because slave Searchables in the MultiSearcher are Remote -- accessed via RMI. I guess I have to code smarter around MultiSearcher. One problem you highlight is that Hits is final -- so it is not possible even to modify the 100/200/400 cache size logic. Any ideas from anyone would be much appreciated. Mark Florence CTO, AIRS 800-897-7714 x 1703 [EMAIL PROTECTED] -Original Message- From: Don Vaillancourt [mailto:[EMAIL PROTECTED] Sent: Monday, July 12, 2004 12:36 pm To: Lucene Users List Subject: Anyone use MultiSearcher class Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient.
RE: Anyone use MultiSearcher class
Eh Mark, Are you involved with Lucene development? At 11:39 AM 26/07/2004, you wrote: Don, at the low level, the issue isn't necessarily caching results from page-to-page (as viewed by some UI.) Such a cache would need to be co-ordinated with index writes. Rather, I plan to focus on the way Hits first reads 100 hits, then 200, then 400 and so on -- but all Hits knows about is the MultiSearcher. This means that in order to find the 101st hit, Hits effectively asks ALL the searchers in the MultiSearcher to search again -- even though it could be known that SOME of those searchers are incapable of returning results. -- Mark Florence -Original Message- From: Don Vaillancourt [mailto:[EMAIL PROTECTED] Sent: Monday, July 26, 2004 11:06 am To: Lucene Users List; Lucene Users List Subject: RE: Anyone use MultiSearcher class Thanks for the info. Maybe the best solution to this may be to perform multiple individual searches, create a container class and store all the hits sorted by relevance within that class and then cache/serialize this result for the current search for page by page manipulation. At 09:46 AM 15/07/2004, Mark Florence wrote: Don, I think I finally understand your problem -- and mine -- with MultiSearcher. I had tested an implementation of my system using ParallelMultiSearcher to split a huge index over many computers. I was very impressed by the results on my test data, but alarmed after a trial with live data :) Consider MultiSearcher.search(Query Q). Suppose that Q aggregated over ALL the Searchables in the MultiSearcher would return 1000 documents. But, the Hits object created by search() will only cache the first 100 documents. When Hits.doc(101) is called, Hits will cache 200 documents -- then 400, 800, 1600 and so on. How does Hits get these extra documents? By calling the MultiSearcher again. Now consider a MultiSearcher as described above with 2 Searchables. With respect to Q, Searchable S has 1000 documents, Searchable T has zero. So to fetch the 101st document, not only is S searched, but T is too, even though the result of Q applied to T is still zero and will always be zero. The same thing will happen when fetching the 201st, 401st and 801st document. This accounts for my slow performance, and I think yours too. That your observed degradation is a power of 2 is a clue. My performance is especially vulnerable because slave Searchables in the MultiSearcher are Remote -- accessed via RMI. I guess I have to code smarter around MultiSearcher. One problem you highlight is that Hits is final -- so it is not possible even to modify the 100/200/400 cache size logic. Any ideas from anyone would be much appreciated. Mark Florence CTO, AIRS 800-897-7714 x 1703 [EMAIL PROTECTED] -Original Message- From: Don Vaillancourt [mailto:[EMAIL PROTECTED] Sent: Monday, July 12, 2004 12:36 pm To: Lucene Users List Subject: Anyone use MultiSearcher class Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of
RE: Anyone use MultiSearcher class
'Fraid not! Just a humble user :) -- Mark -Original Message- From: Don Vaillancourt [mailto:[EMAIL PROTECTED] Sent: Monday, July 26, 2004 12:14 pm To: Lucene Users List Subject: RE: Anyone use MultiSearcher class Eh Mark, Are you involved with Lucene development? At 11:39 AM 26/07/2004, you wrote: Don, at the low level, the issue isn't necessarily caching results from page-to-page (as viewed by some UI.) Such a cache would need to be co-ordinated with index writes. Rather, I plan to focus on the way Hits first reads 100 hits, then 200, then 400 and so on -- but all Hits knows about is the MultiSearcher. This means that in order to find the 101st hit, Hits effectively asks ALL the searchers in the MultiSearcher to search again -- even though it could be known that SOME of those searchers are incapable of returning results. -- Mark Florence -Original Message- From: Don Vaillancourt [mailto:[EMAIL PROTECTED] Sent: Monday, July 26, 2004 11:06 am To: Lucene Users List; Lucene Users List Subject: RE: Anyone use MultiSearcher class Thanks for the info. Maybe the best solution to this may be to perform multiple individual searches, create a container class and store all the hits sorted by relevance within that class and then cache/serialize this result for the current search for page by page manipulation. At 09:46 AM 15/07/2004, Mark Florence wrote: Don, I think I finally understand your problem -- and mine -- with MultiSearcher. I had tested an implementation of my system using ParallelMultiSearcher to split a huge index over many computers. I was very impressed by the results on my test data, but alarmed after a trial with live data :) Consider MultiSearcher.search(Query Q). Suppose that Q aggregated over ALL the Searchables in the MultiSearcher would return 1000 documents. But, the Hits object created by search() will only cache the first 100 documents. When Hits.doc(101) is called, Hits will cache 200 documents -- then 400, 800, 1600 and so on. How does Hits get these extra documents? By calling the MultiSearcher again. Now consider a MultiSearcher as described above with 2 Searchables. With respect to Q, Searchable S has 1000 documents, Searchable T has zero. So to fetch the 101st document, not only is S searched, but T is too, even though the result of Q applied to T is still zero and will always be zero. The same thing will happen when fetching the 201st, 401st and 801st document. This accounts for my slow performance, and I think yours too. That your observed degradation is a power of 2 is a clue. My performance is especially vulnerable because slave Searchables in the MultiSearcher are Remote -- accessed via RMI. I guess I have to code smarter around MultiSearcher. One problem you highlight is that Hits is final -- so it is not possible even to modify the 100/200/400 cache size logic. Any ideas from anyone would be much appreciated. Mark Florence CTO, AIRS 800-897-7714 x 1703 [EMAIL PROTECTED] -Original Message- From: Don Vaillancourt [mailto:[EMAIL PROTECTED] Sent: Monday, July 12, 2004 12:36 pm To: Lucene Users List Subject: Anyone use MultiSearcher class Hello, Has anyone used the Multisearcher class? I have noticed that searching two indexes using this MultiSearcher class takes 8 times longer than searching only one index. I could understand if it took 3 to 4 times longer to search due to sorting the two search results and stuff, but why 8 times longer. Is there some optimization that can be done to hasten the search? Or should I just write my own MultiSearcher. The problem though is that there is no way for me to create my own Hits object (no methods are available and the class is final). Anyone have any clue? Thanks Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended recipient please notify the sender by reply email and immediately delete this email. Use, disclosure or reproduction of this email by anyone other than the intended recipient(s) is strictly prohibited. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Don Vaillancourt Director of Software Development WEB IMPACT INC. 416-815-2000 ext. 245 email: [EMAIL PROTECTED] web: http://www.web-impact.com This email message is intended only for the addressee(s) and contains information that may be confidential and/or copyright. If you are not the intended
Re: Logic of score method in hits class
Lucene scores are not percentages. They really only make sense compared to other scores for the same query. If you like percentages, you can divide all scores by the first score and multiply by 100. Doug lingaraju wrote: Dear All How the score method works(logic) in Hits class For 100% match also score is returning only 69% Thanks and regards Raju - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boosting documents
Rob Clews wrote: I want to do the same, set a boost for a field containing a date that lowers as the date is further from now, is there any way I could do this? You could implement Similarity.idf(Term, Searcher) to, when Term.field().equals(date), return a value that is greater for more recent dates. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: over 300 GB to index: feasability and performance issue
Vincent Le Maout wrote: I have to index a huge, huge amount of data: about 10 million documents making up about 300 GB. Is there any technical limitation in Lucene that could prevent me from processing such amount (I mean, of course, apart from the external limits induce by the hardware: RAM, disks, the system, whatever) ? Lucene is in theory able to support up to 2B documents in a single index. Folks have sucessfully built indexes with several hundred million documents. 10 million should not be a problem. If possible, does anyone have an idea of the amount of resource needed: RAM, CPU time, size of indexes, access time on such a collection ? if not, is it possible to extrapolate an estimation from previous benchmarks ? For simple 2-3 term queries, with average sized documents (~10k of text) you should get decent performance (1 second / query) on a 10M document index. An index typically requires around 35% of the plain text size. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Caching of TermDocs
Is there any way to cache TermDocs? Is this a good idea? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Caching of TermDocs
On Monday 26 July 2004 21:41, John Patterson wrote: Is there any way to cache TermDocs? Is this a good idea? Lucene does this internally by buffering up to 32 document numbers in advance for a query Term. You can view the details here in case you're interested: http://cvs.apache.org/viewcvs.cgi/jakarta-lucene/src/java/org/apache/lucene/search/TermScorer.java It uses the TermDocs.read() method to fill a buffer of document numbers. Is this what you had in mind? Regards, Paul - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Highlighter package updated with overlapping token support
I have updated the Highlighter code in CVS to support tokenizers that generate overlapping tokens. The Junit test rig has a new example test that uses a SynonymTokenizer which generates multiple tokens in the same position for the same input token eg (the token football is expanded into tokens soccer,footie and football). The Formatter interface had to be changed to take a new TokenGroup object instead of a single token but I doubt any code changes in clients are required because most people use the default Formatter implementation and haven't created their own implementations. Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Zilverline release candidate 1.0-rc4 available
All, I've just released a new candidate (*1.0-rc4*) New features include Spanish GUI, RTF support, searching on date range, customizable boosting factors, and configurable analyzers per collection. Zilverline now generates a MD5 Hash per file, and prevents duplicate files from being added more than once. Zilverline supports plugins. You can create your own extractors for various file formats. I've provided Extractors for RTF, Text, PDF, Word, and HTML. Zilverline supports collections. A collection is a set of files and directories in a directory. A collection can be indexed, and searched. The results of the search can be retrieved from local disk or remotely, if you run a webserver on your machine. Files inside zip, rar and chm files are extracted, indexed and can be cached. The cache can be mapped to sit behind your webserver as well. It's also possible to specify your own handlers for archives. Say you have a RAR archive, and you have a program on your system that can extract the content from it, then you can specify that Zilverline should use this program. Zilverline is an free search engine based on lucene that's ready to roll, and can be simply dropped in a Servlet Engine. It runs out of the box, and supports PDF, WORD, HTM, TXT, and CHM, and can index zip, rar, and many other formats. Both on Windows and Linux. Please take look at http://www.zilverline.org, and have a swing at it. cheers, Michael Franken - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Phrase Query
Hello, Can someone on the mailing list send me a copy of sample code of how to implement the phrase query for my search. Regular Query is working fine, but the Phrase Query does not seem to work. TIA, -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Highlighter package updated with overlapping token support
Hi Mark Apologies Please Casn u Provide the URL for the Users to Dwnload the new version of Highlighter package ( jar / Zip format) from u'r main website page. [ Because some of the developers may not have access to CVS downloading (Organization restrictions) from Lucene - sandbox ] Thx in advance with regards Karthik -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 27, 2004 2:28 AM To: [EMAIL PROTECTED] Subject: Highlighter package updated with overlapping token support I have updated the Highlighter code in CVS to support tokenizers that generate overlapping tokens. The Junit test rig has a new example test that uses a SynonymTokenizer which generates multiple tokens in the same position for the same input token eg (the token football is expanded into tokens soccer,footie and football). The Formatter interface had to be changed to take a new TokenGroup object instead of a single token but I doubt any code changes in clients are required because most people use the default Formatter implementation and haven't created their own implementations. Cheers Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Phrase Query
Let's turn it around could you send us your code that is not working? Lucene's test cases show PhraseQuery in action, and working. Erik On Jul 26, 2004, at 4:11 PM, Hetan Shah wrote: Hello, Can someone on the mailing list send me a copy of sample code of how to implement the phrase query for my search. Regular Query is working fine, but the Phrase Query does not seem to work. TIA, -H - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]