Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles
1) Index should be optimized (have only one segment) to make search faster. 2) Large search result is a cause of slow searching. Do you retrieve any stored field of returned hits? Note: Search itself only collects documents' IDs, but retrieving any stored field causes full document retrieving. It hardly increases time of large result set retrieving. So splitting returned result into pages and retrieving any stored info _only_for_current_page_ make search much more faster. That's also good idea to store returned result (IDs and scores or only IDs) into an array and cache it between requests. Documents could be retrieved with $index-getDocument($id) call. With best regards, Alexander Veremyev. Simon Gelfand wrote: Hi Craig, You can see a test here with 130,000 articles indexed I am getting slow searching - 5,6 seconds. I have added paging + max 250 hits displayed + memory caching to speed browsing after an initial search. Here is an example: http://www.articlesbase.com/test-search.php?q=business+consulting+firms (cached) http://www.articlesbase.com/test-search.php?q=business+ (not cached) Any ideas of speeding the search itself? Like limit the amount of results when searching (before it returns like 30,000 results which I slice), minimum score for a result and so on? Simon On 5/9/07, Craig Slusher [EMAIL PROTECTED] wrote: webshark27, When you get your articles indexed, it would be really great if you can share your experience with searching against it. I would love to know how well the Zend implementation of Lucene handles the load. On 5/8/07, webshark27 [EMAIL PROTECTED] wrote: Hi Chris, Thanks for the quick response. Doesn't the $doc = new Zend_Search_Lucene_Document(); just overwrite the old one? Also I think the $index-addDocument($doc) is filling up the memory fast, I don't know exactly how to play with the MergeFactor, MaxMergeDocs and MaxBufferedDocs effects this issue. I am running 10,000 each time and then commit changes - load the script again and running Chris Blaise wrote: It's been a few months since I worked with this but I had some weird errors that I'm not sure if I figured out was due to running out of memory or it if it was due to some weird corruption I was seeing that caused the script to exit. The fix to my problem was to free memory. In your case try setting $doc to null when you're finished with it in the loop, right after the $index-addDocument($doc). Chris -- View this message in context: http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10385215 Sent from the Zend Framework mailing list archive at Nabble.com. -- Craig Slusher [EMAIL PROTECTED]
Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles
Hi Alexander, 1. I Optimized using Luke 0.6 - so there is 1 segment (183mb) a couple of days ago. 2. The search takes 5 seconds before I display any results, just this line: $hits = $index-find($query); And it returns a ton of data, not just the Document's ID. Here: http://www.articlesbase.com/test-search2.php?q=business+consulting Is there a way to limit the number of results returned or a minimum score? PS. I also need to set ini_set(memory_limit,300M); For the script to even run. Thanks, Simon Alexander Veremyev wrote: 1) Index should be optimized (have only one segment) to make search faster. 2) Large search result is a cause of slow searching. Do you retrieve any stored field of returned hits? Note: Search itself only collects documents' IDs, but retrieving any stored field causes full document retrieving. It hardly increases time of large result set retrieving. So splitting returned result into pages and retrieving any stored info _only_for_current_page_ make search much more faster. That's also good idea to store returned result (IDs and scores or only IDs) into an array and cache it between requests. Documents could be retrieved with $index-getDocument($id) call. With best regards, Alexander Veremyev. -- View this message in context: http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10606551 Sent from the Zend Framework mailing list archive at Nabble.com.
Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles
webshark27 wrote: Hi Alexander, 1. I Optimized using Luke 0.6 - so there is 1 segment (183mb) a couple of days ago. 2. The search takes 5 seconds before I display any results, just this line: $hits = $index-find($query); And it returns a ton of data, not just the Document's ID. Here: http://www.articlesbase.com/test-search2.php?q=business+consulting Only note: $index-find($query) actually returns only IDs and scores. It's an array of QueryHit objects. QueryHit object contains only ID ans Score fields initially, but automatically retrieves document from an index when any stored field is retrieved via QueryHit property. Is there a way to limit the number of results returned or a minimum score? Zend_Search_Lucene needs to calculate all scores to limit search results by scores. So it doesn't help. Apache Lucene has special weight implementation which returns results in document id order. It may help to limit search result, but it's not implemented in Zend_Search_Lucene now. PS. I also need to set ini_set(memory_limit,300M); Zend_Search preloads terms dictionary index (it's usually each 128th term) and stores it in memory. It looks like you have very large terms dictionary which may be produced by large or non-tokenized unique indexed fields. Could I ask you to put your index (tarball or zip) somewhere for downloading to play with it? With best regards, Alexander Veremyev. For the script to even run. Thanks, Simon Alexander Veremyev wrote: 1) Index should be optimized (have only one segment) to make search faster. 2) Large search result is a cause of slow searching. Do you retrieve any stored field of returned hits? Note: Search itself only collects documents' IDs, but retrieving any stored field causes full document retrieving. It hardly increases time of large result set retrieving. So splitting returned result into pages and retrieving any stored info _only_for_current_page_ make search much more faster. That's also good idea to store returned result (IDs and scores or only IDs) into an array and cache it between requests. Documents could be retrieved with $index-getDocument($id) call. With best regards, Alexander Veremyev.
Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles
Hi Craig, You can see a test here with 130,000 articles indexed I am getting slow searching - 5,6 seconds. I have added paging + max 250 hits displayed + memory caching to speed browsing after an initial search. Here is an example: http://www.articlesbase.com/test-search.php?q=business+consulting+firms (cached) http://www.articlesbase.com/test-search.php?q=business+ (not cached) Any ideas of speeding the search itself? Like limit the amount of results when searching (before it returns like 30,000 results which I slice), minimum score for a result and so on? Simon On 5/9/07, Craig Slusher [EMAIL PROTECTED] wrote: webshark27, When you get your articles indexed, it would be really great if you can share your experience with searching against it. I would love to know how well the Zend implementation of Lucene handles the load. On 5/8/07, webshark27 [EMAIL PROTECTED] wrote: Hi Chris, Thanks for the quick response. Doesn't the $doc = new Zend_Search_Lucene_Document(); just overwrite the old one? Also I think the $index-addDocument($doc) is filling up the memory fast, I don't know exactly how to play with the MergeFactor, MaxMergeDocs and MaxBufferedDocs effects this issue. I am running 10,000 each time and then commit changes - load the script again and running Chris Blaise wrote: It's been a few months since I worked with this but I had some weird errors that I'm not sure if I figured out was due to running out of memory or it if it was due to some weird corruption I was seeing that caused the script to exit. The fix to my problem was to free memory. In your case try setting $doc to null when you're finished with it in the loop, right after the $index-addDocument($doc). Chris -- View this message in context: http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10385215 Sent from the Zend Framework mailing list archive at Nabble.com. -- Craig Slusher [EMAIL PROTECTED] -- Simon Gelfand - http://www.articlesbase.com http://www.reader.co.il http://www.articuloz.com http://www.rusarticles.com http://www.tripslog.com http://www.simongelfand.com
Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles
webshark27, When you get your articles indexed, it would be really great if you can share your experience with searching against it. I would love to know how well the Zend implementation of Lucene handles the load. On 5/8/07, webshark27 [EMAIL PROTECTED] wrote: Hi Chris, Thanks for the quick response. Doesn't the $doc = new Zend_Search_Lucene_Document(); just overwrite the old one? Also I think the $index-addDocument($doc) is filling up the memory fast, I don't know exactly how to play with the MergeFactor, MaxMergeDocs and MaxBufferedDocs effects this issue. I am running 10,000 each time and then commit changes - load the script again and running Chris Blaise wrote: It's been a few months since I worked with this but I had some weird errors that I'm not sure if I figured out was due to running out of memory or it if it was due to some weird corruption I was seeing that caused the script to exit. The fix to my problem was to free memory. In your case try setting $doc to null when you're finished with it in the loop, right after the $index-addDocument($doc). Chris -- View this message in context: http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10385215 Sent from the Zend Framework mailing list archive at Nabble.com. -- Craig Slusher [EMAIL PROTECTED]
[fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles
Hi Chris, Thanks for the quick response. Doesn't the $doc = new Zend_Search_Lucene_Document(); just overwrite the old one? Also I think the $index-addDocument($doc) is filling up the memory fast, I don't know exactly how to play with the MergeFactor, MaxMergeDocs and MaxBufferedDocs effects this issue. I am running 10,000 each time and then commit changes - load the script again and running Chris Blaise wrote: It's been a few months since I worked with this but I had some weird errors that I'm not sure if I figured out was due to running out of memory or it if it was due to some weird corruption I was seeing that caused the script to exit. The fix to my problem was to free memory. In your case try setting $doc to null when you're finished with it in the loop, right after the $index-addDocument($doc). Chris -- View this message in context: http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10385215 Sent from the Zend Framework mailing list archive at Nabble.com.