Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles

2007-05-14 Thread webshark27

Hi Alexander,

1. I Optimized using Luke 0.6 - so there is 1 segment (183mb) a couple of
days ago.

2. The search takes 5 seconds before I display any results, just this line:

$hits  = $index->find($query);

And it returns a ton of data, not just the Document's ID.

Here: http://www.articlesbase.com/test-search2.php?q=business+consulting

Is there a way to limit the number of results returned or a minimum score?

PS. I also need to set 

ini_set("memory_limit","300M");

For the script to even run.

Thanks,

Simon


Alexander Veremyev wrote:
> 
> 1) Index should be optimized (have only one segment) to make search
> faster.
> 
> 2) Large search result is a cause of slow searching.
> Do you retrieve any stored field of returned hits?
> 
> Note:
> Search itself only collects documents' IDs, but retrieving any stored 
> field causes full document retrieving. It hardly increases time of large 
> result set retrieving.
> So splitting returned result into pages and retrieving any stored info 
> _only_for_current_page_ make search much more faster.
> 
> That's also good idea to store returned result (IDs and scores or only 
> IDs) into an array and cache it between requests.
> Documents could be retrieved with $index->getDocument($id) call.
> 
> With best regards,
> Alexander Veremyev.
> 

-- 
View this message in context: 
http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10606551
Sent from the Zend Framework mailing list archive at Nabble.com.



Re: [fw-general] Zend_Search_Lucene

2007-05-11 Thread webshark27

Hi Andries,

That is normal behavior under a windows machine.

You don't need the dot, just remember the path is relevant for the drive you
are running the script on.

Simon
 

Andries Seutens wrote:
> 
> 
> Hello,
> 
> There is no broken line or space
> 
> Best,
> 
> Andriesss
> 
> Patrycjusz Szydło schreef:
>> Look at the last line of your code, there could be broken line or space.
>>
>> Best,
>> patS
>>
>> Andries Seutens pisze:
>>>
>>> Hello,
>>>
>>> I am not sure if this is a bug or a feature ;) :
>>>
>>> My PHP version: 5.2.0
>>> Operating system: Windows XP - Home edition SP2
>>>
>>> My code:
>>>
>>> ---
>>> >> require_once 'Zend/Search/Lucene.php';
>>>
>>> $index = Zend_Search_Lucene::create('./data/my-index'); // mind the 
>>> '.' in front of the path
>>>
>>> $doc = new Zend_Search_Lucene_Document();
>>> $doc->addField(Zend_Search_Lucene_Field::UnIndexed('created', time()));
>>> $doc->addField(Zend_Search_Lucene_Field::Text('annotation', 'Document 
>>> annotation text'));
>>>
>>> $index->addDocument($doc);
>>> ---
>>>
>>> This throws: *Fatal error*: Exception thrown without a stack frame in 
>>> *Unknown* on line *0
>>>
>>> *Removing the dot in front of the path resolves the issue, i'm not 
>>> sure if this is normal behaviour?
>>>
>>> Best regards,
>>>
>>>
>>> **
>>>
>>> 
>>>
>>> Gecontroleerd op virussen door de JOJO Secure Gateway.
>>>   
>> 
>>
>> No virus found in this incoming message.
>> Checked by AVG Free Edition. 
>> Version: 7.5.467 / Virus Database: 269.6.6/795 - Release Date: 9/05/2007
>> 15:07
>>   
> 
> 
> -- 
> Andries Seutens
> http://andries.systray.be
> 
> 
> Gecontroleerd op virussen door de JOJO Secure Gateway.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Zend_Search_Lucene-tf3721654s16154.html#a10436026
Sent from the Zend Framework mailing list archive at Nabble.com.



Re: [fw-general] Zend_Search_Lucene - Best Practices for Indexing 100k+ articles

2007-05-11 Thread webshark27

I ran Lucene Index Toolbox as suggested on
http://framework.zend.com/manual/en/zend.search.index-creation.html#zend.search.index-creation.document-updating

After running an optimization there is no longer a .cfs file or a segment
file.

When I try to search I get
"object(Zend_Search_Lucene_Exception)#5 (6) { ["message:protected"]=> 
string(127) "fopen(c:/tmp/ab_index/segments) [function.fopen]: failed to
open stream: No such file or directory" ["string:private"]=>  string(0) ""
["code:protected"]=>  int(0) ["file:protected"]=>  string(73)
"C:\Inetpub\ArticlesBaseNew\Zend\Search\Lucene\Storage\File\Filesystem.php"
["line:protected"]=>  int(63) ["trace:private"]=>  array(4) {"

because there is no file of that nature, there is however a file called
segments.gen or segments_f.

Any Ideas?

-- 
View this message in context: 
http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10434181
Sent from the Zend Framework mailing list archive at Nabble.com.



Re: [fw-general] Zend_Search_Lucene - Best Practices for Indexing 100k+ articles

2007-05-09 Thread webshark27

Hi,

I am running MaxMergeDocs with 1500 the rest with default (10 and 10);

The script online runs out of memory when I try to force an Optimize but
still it indexes the articles.

I currently have around 90 .cfs files

But now if I try to search I always get (a var dump of the Exception error)

Object(Zend_Search_Lucene_Exception)#4 (6) {
  ["message:protected"]=>
  string(156) "fopen(/path to the file/ab_index/index.lock) [ function.fopen
function.fopen ]: failed to open stream: Permission denied"
  ["string:private"]=>
  string(0) ""
  ["code:protected"]=>
  int(0)
  ["file:protected"]=>
  string(100) "/path to the
file/public_html/Zend/Search/Lucene/Storage/File/Filesystem.php"
  ["line:protected"]=>

Thanks,

Simon


Alexander Veremyev wrote:
> 
> Hi,
> 
> Zend_Search_Lucene uses memory for:
> 1. preloaded term dictionary index for reach index segment;
> So large number of segments increases memory usage.
> Segments may be merged into one with Zend_Search_Lucene::optimize()
> method.
> Segments are also partially auto-merged with auto-optimization process. 
> Auto-optimization behavior depends on MergeFactor and MaxMergeDocs 
> parameters.
> 
> 2. buffered docs (documents, which are indexed, but not dumped into new 
> segment);
> When number of buffered docs reaches MaxBufferedDocs parameter, new 
> segment is dumped into disk. It frees memory used for buffered docs.
> 
> Did you changed MergeFactor, MaxMergeDocs or MaxBufferedDocs parameters? 
> Or did you use default settings?
> 
> Which number of segments (number of .cfs files in index directory) do 
> you have when script crashes?
> 
> With best regards,
> Alexander Veremyev.
> 

-- 
View this message in context: 
http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10403823
Sent from the Zend Framework mailing list archive at Nabble.com.



[fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles

2007-05-08 Thread webshark27

Hi Chris,

Thanks for the quick response.

Doesn't the "$doc = new Zend_Search_Lucene_Document();" just overwrite the
old one?

Also I think the $index->addDocument($doc) is filling up the memory fast, I
don't know exactly how to play with the MergeFactor, MaxMergeDocs and
MaxBufferedDocs effects this issue.

I am running 10,000 each time and then commit changes - load the script
again and running 


Chris Blaise wrote:
> 
> 
>   It's been a few months since I worked with this but I had some weird
> errors that I'm not sure if I figured out was due to running out of memory
> or it if it was due to some weird corruption I was seeing that caused the
> script to exit.
> 
>   The fix to my problem was to free memory.  In your case try setting
> $doc to null when you're finished with it in the loop, right after the
> $index->addDocument($doc).
> 
>  Chris
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10385215
Sent from the Zend Framework mailing list archive at Nabble.com.



[fw-general] Zend_Search_Lucene - Best Practices for Indexing 100k+ articles

2007-05-08 Thread webshark27

Hello,

I am trying to create a new index.

I have over 130,000 full text articles in a MySQL database ranging from 300
- 1000 words.

I am trying to figure out the best practice to create the index as I am
running in to issues with Max memory exhausted errors when I get to around
15,000 articles.

I read http://framework.zend.com/manual/en/zend.search.index-creation.html

So I am basically doing this:

addField(Zend_Search_Lucene_Field::Text('title', sanitize($title)));
  $doc->addField(Zend_Search_Lucene_Field::Text('author',
sanitize($pname)));
  $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', 
sanitize($content)));
  $index->addDocument($doc);
}
$index->commit();
?>

This crashes with memory exhausted for a PHP script (I have it set at 40MB).

I am trying to figure out what is being loaded in to memory and what would
the best way to run a script for a few hours that will index my whole DB
from the start to end.

Any help would be appreciated.

-- 
View this message in context: 
http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10383911
Sent from the Zend Framework mailing list archive at Nabble.com.