I have a mysql table consisting of about 500K rows which is about 300MB in
size, that I've been trying to index into Zend Lucene. It worked perfectly
fine when adding small batches at a time, but if I tried to index more than
2000 at a time, the script would do strange things after indexing around
2500 documents. The script would end prematurely, and I would end up with
some documents being indexed twice.

I stripped the script down to the most basic form I could where it still
produced the problem, and realized what was happening. First of all, here is
the code:

        require_once('Zend/Search/Lucene.php');
        $index = Zend_Search_Lucene::open('lucene/videos');

        set_time_limit(0);

        for ($x=0;$x<200000;$x++)
        {
            $doc = new Zend_Search_Lucene_Document();
            $doc->addField(Zend_Search_Lucene_Field::Keyword('field',
'value'));
            $index->addDocument($doc);
        }


As you can see, I'm not retrieving anything from the DB anymore, i'm just
trying to see how many test documents I can insert before the script fails.
The script ran fine with $x<100000. When I upped it to 200000, the problem
occurred.

I realized after some testing that it wasn't the number of documents that
was the issue, it was an issue of time. I added a line at the beginning of
the script to insert a timestamped row into a test table every time the
script started running, and discovered that for some reason, if the script
executed for more than a certain amount of time, it would start over again
from the beginning. Actually, it wouldn't just start over, but it would
clone another instance of itself, which would run alongside of the original
instance of the script. Then sometime after that, it would clone another,
and another, and then all the scripts would fizzle out for unknown reasons,
sometimes giving an error.

So I would run the script, and check the test table, and see anywhere from 2
to 5 rows, indicating that the script had started up again that many times.
While the number of documents it would index before starting over would
vary, the TIME would remain the same. The first restart would always come
EXACTLY 5 minutes and 4 seconds after the initial running of the script. I
know this by comparing the timestamps of the two table rows. Then any
successive rows would be almost exactly 3 minutes after the previous one,
but sometimes 2 minutes.

While observing the actual PHP processes in linux TOP, I'd see the lsphp5
process running at high cpu power as the script executes, and when the
"restarts" occur, I would see multiple instances of lsphp5 appear, all
running at high CPU%. (I am using Litespeed web server, and have tried with
Apache and the same thing happened with the httpd processes.)

It is absolutely blowing my mind as to why this is happening. Does anyone
have any clue at all what could be going on, or anything else I could try?
Here are the 2 errors I have managed to catch. These are the errors I
sometimes would see when the script (or one of them at least) died after
executing it from my web browser. Sometimes I'd see this error but other
instances of the script would continue to run on (as I watched from TOP).

Fatal error: Uncaught exception 'Zend_Search_Lucene_Exception' with
message 'Index is under processing now' in
/home/public_html/boombada.com/private/ZendFramework-1.0.3/library/Zend/Search/Lucene.php:240
Stack trace: #0
/home/public_html/boombada.com/private/ZendFramework-1.0.3/library/Zend/Search/Lucene.php(416):
Zend_Search_Lucene::getActualGeneration(Object(Zend_Search_Lucene_Storage_Directory_Filesystem))
#1 
/home/public_html/boombada.com/private/ZendFramework-1.0.3/library/Zend/Search/Lucene.php(175):
Zend_Search_Lucene->__construct('lucene/videos', true) #2
/home/public_html/boombada.com/private/system/application/controllers/test.php(184):
Zend_Search_Lucene::open('lucene/videos') #3 [internal function]:
Test->luceneC() #4
/home/public_html/private/system/codeigniter/CodeIgniter.php(216):
call_user_func_array(Array, Array) #5
/home/public_html/public/index.php(132):
require_once('/home/public_ht...') #6 {main} thrown in
/home/public_html/private/ZendFramework-1.0.3/library/Zend/Search/Lucene.php
on line 240

Fatal error: Ignoring exception from
Zend_Search_Lucene_Proxy::__destruct() while an exception is already
active (Uncaught Zend_Search_Lucene_Exception in
/home/public_html/private/ZendFramework-1.0.3/library/Zend/Search/Lucene/Storage/File/Filesystem.php
on line 59) in /home/public_html/private/system/application/controllers/test.php
on line 228


Here are a few things I think I've ruled out:

1) Operating system
I've tried running the script on both Ubuntu Gutsy and Windows XP, and the
same thing happens on both. The times between the restarts is longer on XP,
but they still occur.

2) Web server
I've tried both Litespeed and Apache, the problem occurs on both.

3) Framework
I'm running this script from the CodeIgniter framework, but have tried it
without the framework with no success.

4) PHP execution time limit or memory limit
As you can see, I have set_time_limit(0) in the script, and have tried many
different memory settings without any differences in the outcome of the
script. Currently I have PHP's max memory set to 128M.

5) Index optimization
I've tried messing around with the 3 auto-optimization settings, having them
at various settings from default all the way up to 10,000 did not effect the
restarting or time between restarts. I have also tried performing full
optimizations at various intervals during indexing, which had no effect
either.

6) Version of Zend Framework
I've tried both the latest stable release 1.0.3 and the 1.5.0 release, same
problem with both.

7) Browser
I thought that maybe it was the browser I was running the script from, that
was requesting the page multiple times when the connection to the web server
was timing out or something, but not only have I tried this with 3 different
browsers (FF, safari, IE), I have successfully run other very long running
scripts (48+ hours) not involving zend lucene, with my browser in the past.

I've exhausted all my energy into trying to get this to work for the past
few weeks. This is pretty much a last ditch effort for me to get Lucene
working, because I really want to use it for my site instead of searching
mysql directly. I can't imagine being the only one who's ever faced this
problem, just because my setup and the script I'm trying to run, are both so
simple. Any suggestions at all would be greatly appreciated. Thanks in
advance,

-Jay

Reply via email to