I have a mysql table consisting of about 500K rows which is about 300MB in size, that I've been trying to index into Zend Lucene. It worked perfectly fine when adding small batches at a time, but if I tried to index more than 2000 at a time, the script would do strange things after indexing around 2500 documents. The script would end prematurely, and I would end up with some documents being indexed twice.
I stripped the script down to the most basic form I could where it still produced the problem, and realized what was happening. First of all, here is the code: require_once('Zend/Search/Lucene.php'); $index = Zend_Search_Lucene::open('lucene/videos'); set_time_limit(0); for ($x=0;$x<200000;$x++) { $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::Keyword('field', 'value')); $index->addDocument($doc); } As you can see, I'm not retrieving anything from the DB anymore, i'm just trying to see how many test documents I can insert before the script fails. The script ran fine with $x<100000. When I upped it to 200000, the problem occurred. I realized after some testing that it wasn't the number of documents that was the issue, it was an issue of time. I added a line at the beginning of the script to insert a timestamped row into a test table every time the script started running, and discovered that for some reason, if the script executed for more than a certain amount of time, it would start over again from the beginning. Actually, it wouldn't just start over, but it would clone another instance of itself, which would run alongside of the original instance of the script. Then sometime after that, it would clone another, and another, and then all the scripts would fizzle out for unknown reasons, sometimes giving an error. So I would run the script, and check the test table, and see anywhere from 2 to 5 rows, indicating that the script had started up again that many times. While the number of documents it would index before starting over would vary, the TIME would remain the same. The first restart would always come EXACTLY 5 minutes and 4 seconds after the initial running of the script. I know this by comparing the timestamps of the two table rows. Then any successive rows would be almost exactly 3 minutes after the previous one, but sometimes 2 minutes. While observing the actual PHP processes in linux TOP, I'd see the lsphp5 process running at high cpu power as the script executes, and when the "restarts" occur, I would see multiple instances of lsphp5 appear, all running at high CPU%. (I am using Litespeed web server, and have tried with Apache and the same thing happened with the httpd processes.) It is absolutely blowing my mind as to why this is happening. Does anyone have any clue at all what could be going on, or anything else I could try? Here are the 2 errors I have managed to catch. These are the errors I sometimes would see when the script (or one of them at least) died after executing it from my web browser. Sometimes I'd see this error but other instances of the script would continue to run on (as I watched from TOP). Fatal error: Uncaught exception 'Zend_Search_Lucene_Exception' with message 'Index is under processing now' in /home/public_html/boombada.com/private/ZendFramework-1.0.3/library/Zend/Search/Lucene.php:240 Stack trace: #0 /home/public_html/boombada.com/private/ZendFramework-1.0.3/library/Zend/Search/Lucene.php(416): Zend_Search_Lucene::getActualGeneration(Object(Zend_Search_Lucene_Storage_Directory_Filesystem)) #1 /home/public_html/boombada.com/private/ZendFramework-1.0.3/library/Zend/Search/Lucene.php(175): Zend_Search_Lucene->__construct('lucene/videos', true) #2 /home/public_html/boombada.com/private/system/application/controllers/test.php(184): Zend_Search_Lucene::open('lucene/videos') #3 [internal function]: Test->luceneC() #4 /home/public_html/private/system/codeigniter/CodeIgniter.php(216): call_user_func_array(Array, Array) #5 /home/public_html/public/index.php(132): require_once('/home/public_ht...') #6 {main} thrown in /home/public_html/private/ZendFramework-1.0.3/library/Zend/Search/Lucene.php on line 240 Fatal error: Ignoring exception from Zend_Search_Lucene_Proxy::__destruct() while an exception is already active (Uncaught Zend_Search_Lucene_Exception in /home/public_html/private/ZendFramework-1.0.3/library/Zend/Search/Lucene/Storage/File/Filesystem.php on line 59) in /home/public_html/private/system/application/controllers/test.php on line 228 Here are a few things I think I've ruled out: 1) Operating system I've tried running the script on both Ubuntu Gutsy and Windows XP, and the same thing happens on both. The times between the restarts is longer on XP, but they still occur. 2) Web server I've tried both Litespeed and Apache, the problem occurs on both. 3) Framework I'm running this script from the CodeIgniter framework, but have tried it without the framework with no success. 4) PHP execution time limit or memory limit As you can see, I have set_time_limit(0) in the script, and have tried many different memory settings without any differences in the outcome of the script. Currently I have PHP's max memory set to 128M. 5) Index optimization I've tried messing around with the 3 auto-optimization settings, having them at various settings from default all the way up to 10,000 did not effect the restarting or time between restarts. I have also tried performing full optimizations at various intervals during indexing, which had no effect either. 6) Version of Zend Framework I've tried both the latest stable release 1.0.3 and the 1.5.0 release, same problem with both. 7) Browser I thought that maybe it was the browser I was running the script from, that was requesting the page multiple times when the connection to the web server was timing out or something, but not only have I tried this with 3 different browsers (FF, safari, IE), I have successfully run other very long running scripts (48+ hours) not involving zend lucene, with my browser in the past. I've exhausted all my energy into trying to get this to work for the past few weeks. This is pretty much a last ditch effort for me to get Lucene working, because I really want to use it for my site instead of searching mysql directly. I can't imagine being the only one who's ever faced this problem, just because my setup and the script I'm trying to run, are both so simple. Any suggestions at all would be greatly appreciated. Thanks in advance, -Jay