Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles
Hi Alexander, 1. I Optimized using Luke 0.6 - so there is 1 segment (183mb) a couple of days ago. 2. The search takes 5 seconds before I display any results, just this line: $hits = $index->find($query); And it returns a ton of data, not just the Document's ID. Here: http://www.articlesbase.com/test-search2.php?q=business+consulting Is there a way to limit the number of results returned or a minimum score? PS. I also need to set ini_set("memory_limit","300M"); For the script to even run. Thanks, Simon Alexander Veremyev wrote: > > 1) Index should be optimized (have only one segment) to make search > faster. > > 2) Large search result is a cause of slow searching. > Do you retrieve any stored field of returned hits? > > Note: > Search itself only collects documents' IDs, but retrieving any stored > field causes full document retrieving. It hardly increases time of large > result set retrieving. > So splitting returned result into pages and retrieving any stored info > _only_for_current_page_ make search much more faster. > > That's also good idea to store returned result (IDs and scores or only > IDs) into an array and cache it between requests. > Documents could be retrieved with $index->getDocument($id) call. > > With best regards, > Alexander Veremyev. > -- View this message in context: http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10606551 Sent from the Zend Framework mailing list archive at Nabble.com.
Re: [fw-general] Zend_Search_Lucene
Hi Andries, That is normal behavior under a windows machine. You don't need the dot, just remember the path is relevant for the drive you are running the script on. Simon Andries Seutens wrote: > > > Hello, > > There is no broken line or space > > Best, > > Andriesss > > Patrycjusz Szydło schreef: >> Look at the last line of your code, there could be broken line or space. >> >> Best, >> patS >> >> Andries Seutens pisze: >>> >>> Hello, >>> >>> I am not sure if this is a bug or a feature ;) : >>> >>> My PHP version: 5.2.0 >>> Operating system: Windows XP - Home edition SP2 >>> >>> My code: >>> >>> --- >>> >> require_once 'Zend/Search/Lucene.php'; >>> >>> $index = Zend_Search_Lucene::create('./data/my-index'); // mind the >>> '.' in front of the path >>> >>> $doc = new Zend_Search_Lucene_Document(); >>> $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', time())); >>> $doc->addField(Zend_Search_Lucene_Field::Text('annotation', 'Document >>> annotation text')); >>> >>> $index->addDocument($doc); >>> --- >>> >>> This throws: *Fatal error*: Exception thrown without a stack frame in >>> *Unknown* on line *0 >>> >>> *Removing the dot in front of the path resolves the issue, i'm not >>> sure if this is normal behaviour? >>> >>> Best regards, >>> >>> >>> ** >>> >>> >>> >>> Gecontroleerd op virussen door de JOJO Secure Gateway. >>> >> >> >> No virus found in this incoming message. >> Checked by AVG Free Edition. >> Version: 7.5.467 / Virus Database: 269.6.6/795 - Release Date: 9/05/2007 >> 15:07 >> > > > -- > Andries Seutens > http://andries.systray.be > > > Gecontroleerd op virussen door de JOJO Secure Gateway. > > -- View this message in context: http://www.nabble.com/Zend_Search_Lucene-tf3721654s16154.html#a10436026 Sent from the Zend Framework mailing list archive at Nabble.com.
Re: [fw-general] Zend_Search_Lucene - Best Practices for Indexing 100k+ articles
I ran Lucene Index Toolbox as suggested on http://framework.zend.com/manual/en/zend.search.index-creation.html#zend.search.index-creation.document-updating After running an optimization there is no longer a .cfs file or a segment file. When I try to search I get "object(Zend_Search_Lucene_Exception)#5 (6) { ["message:protected"]=> string(127) "fopen(c:/tmp/ab_index/segments) [function.fopen]: failed to open stream: No such file or directory" ["string:private"]=> string(0) "" ["code:protected"]=> int(0) ["file:protected"]=> string(73) "C:\Inetpub\ArticlesBaseNew\Zend\Search\Lucene\Storage\File\Filesystem.php" ["line:protected"]=> int(63) ["trace:private"]=> array(4) {" because there is no file of that nature, there is however a file called segments.gen or segments_f. Any Ideas? -- View this message in context: http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10434181 Sent from the Zend Framework mailing list archive at Nabble.com.
Re: [fw-general] Zend_Search_Lucene - Best Practices for Indexing 100k+ articles
Hi, I am running MaxMergeDocs with 1500 the rest with default (10 and 10); The script online runs out of memory when I try to force an Optimize but still it indexes the articles. I currently have around 90 .cfs files But now if I try to search I always get (a var dump of the Exception error) Object(Zend_Search_Lucene_Exception)#4 (6) { ["message:protected"]=> string(156) "fopen(/path to the file/ab_index/index.lock) [ function.fopen function.fopen ]: failed to open stream: Permission denied" ["string:private"]=> string(0) "" ["code:protected"]=> int(0) ["file:protected"]=> string(100) "/path to the file/public_html/Zend/Search/Lucene/Storage/File/Filesystem.php" ["line:protected"]=> Thanks, Simon Alexander Veremyev wrote: > > Hi, > > Zend_Search_Lucene uses memory for: > 1. preloaded term dictionary index for reach index segment; > So large number of segments increases memory usage. > Segments may be merged into one with Zend_Search_Lucene::optimize() > method. > Segments are also partially auto-merged with auto-optimization process. > Auto-optimization behavior depends on MergeFactor and MaxMergeDocs > parameters. > > 2. buffered docs (documents, which are indexed, but not dumped into new > segment); > When number of buffered docs reaches MaxBufferedDocs parameter, new > segment is dumped into disk. It frees memory used for buffered docs. > > Did you changed MergeFactor, MaxMergeDocs or MaxBufferedDocs parameters? > Or did you use default settings? > > Which number of segments (number of .cfs files in index directory) do > you have when script crashes? > > With best regards, > Alexander Veremyev. > -- View this message in context: http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10403823 Sent from the Zend Framework mailing list archive at Nabble.com.
[fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles
Hi Chris, Thanks for the quick response. Doesn't the "$doc = new Zend_Search_Lucene_Document();" just overwrite the old one? Also I think the $index->addDocument($doc) is filling up the memory fast, I don't know exactly how to play with the MergeFactor, MaxMergeDocs and MaxBufferedDocs effects this issue. I am running 10,000 each time and then commit changes - load the script again and running Chris Blaise wrote: > > > It's been a few months since I worked with this but I had some weird > errors that I'm not sure if I figured out was due to running out of memory > or it if it was due to some weird corruption I was seeing that caused the > script to exit. > > The fix to my problem was to free memory. In your case try setting > $doc to null when you're finished with it in the loop, right after the > $index->addDocument($doc). > > Chris > > -- View this message in context: http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10385215 Sent from the Zend Framework mailing list archive at Nabble.com.
[fw-general] Zend_Search_Lucene - Best Practices for Indexing 100k+ articles
Hello, I am trying to create a new index. I have over 130,000 full text articles in a MySQL database ranging from 300 - 1000 words. I am trying to figure out the best practice to create the index as I am running in to issues with Max memory exhausted errors when I get to around 15,000 articles. I read http://framework.zend.com/manual/en/zend.search.index-creation.html So I am basically doing this: addField(Zend_Search_Lucene_Field::Text('title', sanitize($title))); $doc->addField(Zend_Search_Lucene_Field::Text('author', sanitize($pname))); $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', sanitize($content))); $index->addDocument($doc); } $index->commit(); ?> This crashes with memory exhausted for a PHP script (I have it set at 40MB). I am trying to figure out what is being loaded in to memory and what would the best way to run a script for a few hours that will index my whole DB from the start to end. Any help would be appreciated. -- View this message in context: http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10383911 Sent from the Zend Framework mailing list archive at Nabble.com.