Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles

2007-05-14 Thread Alexander Veremyev

1) Index should be optimized (have only one segment) to make search faster.

2) Large search result is a cause of slow searching.
Do you retrieve any stored field of returned hits?

Note:
Search itself only collects documents' IDs, but retrieving any stored 
field causes full document retrieving. It hardly increases time of large 
result set retrieving.
So splitting returned result into pages and retrieving any stored info 
_only_for_current_page_ make search much more faster.


That's also good idea to store returned result (IDs and scores or only 
IDs) into an array and cache it between requests.

Documents could be retrieved with $index-getDocument($id) call.

With best regards,
   Alexander Veremyev.

Simon Gelfand wrote:

Hi Craig,

You can see a test here with 130,000 articles indexed I am getting
slow searching - 5,6 seconds.

I have added paging + max 250 hits displayed + memory caching to speed
browsing after an initial search.

Here is an example:
http://www.articlesbase.com/test-search.php?q=business+consulting+firms 
(cached)

http://www.articlesbase.com/test-search.php?q=business+ (not cached)

Any ideas of speeding the search itself?

Like limit the amount of results when searching (before it returns
like 30,000 results which I slice), minimum score for a result and so
on?

Simon

On 5/9/07, Craig Slusher [EMAIL PROTECTED] wrote:

webshark27,

When you get your articles indexed, it would be really great if you
can share your experience with searching against it. I would love to
know how well the Zend implementation of Lucene handles the load.

On 5/8/07, webshark27 [EMAIL PROTECTED] wrote:

 Hi Chris,

 Thanks for the quick response.

 Doesn't the $doc = new Zend_Search_Lucene_Document(); just 
overwrite the

 old one?

 Also I think the $index-addDocument($doc) is filling up the memory 
fast, I

 don't know exactly how to play with the MergeFactor, MaxMergeDocs and
 MaxBufferedDocs effects this issue.

 I am running 10,000 each time and then commit changes - load the script
 again and running 


 Chris Blaise wrote:
 
 
It's been a few months since I worked with this but I had 
some weird
  errors that I'm not sure if I figured out was due to running out 
of memory
  or it if it was due to some weird corruption I was seeing that 
caused the

  script to exit.
 
The fix to my problem was to free memory.  In your case try 
setting
  $doc to null when you're finished with it in the loop, right after 
the

  $index-addDocument($doc).
 
   Chris
 
 

 --
 View this message in context: 
http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10385215 


 Sent from the Zend Framework mailing list archive at Nabble.com.




--
Craig Slusher
[EMAIL PROTECTED]








Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles

2007-05-14 Thread webshark27

Hi Alexander,

1. I Optimized using Luke 0.6 - so there is 1 segment (183mb) a couple of
days ago.

2. The search takes 5 seconds before I display any results, just this line:

$hits  = $index-find($query);

And it returns a ton of data, not just the Document's ID.

Here: http://www.articlesbase.com/test-search2.php?q=business+consulting

Is there a way to limit the number of results returned or a minimum score?

PS. I also need to set 

ini_set(memory_limit,300M);

For the script to even run.

Thanks,

Simon


Alexander Veremyev wrote:
 
 1) Index should be optimized (have only one segment) to make search
 faster.
 
 2) Large search result is a cause of slow searching.
 Do you retrieve any stored field of returned hits?
 
 Note:
 Search itself only collects documents' IDs, but retrieving any stored 
 field causes full document retrieving. It hardly increases time of large 
 result set retrieving.
 So splitting returned result into pages and retrieving any stored info 
 _only_for_current_page_ make search much more faster.
 
 That's also good idea to store returned result (IDs and scores or only 
 IDs) into an array and cache it between requests.
 Documents could be retrieved with $index-getDocument($id) call.
 
 With best regards,
 Alexander Veremyev.
 

-- 
View this message in context: 
http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10606551
Sent from the Zend Framework mailing list archive at Nabble.com.



Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles

2007-05-14 Thread Alexander Veremyev

webshark27 wrote:

Hi Alexander,

1. I Optimized using Luke 0.6 - so there is 1 segment (183mb) a couple of
days ago.

2. The search takes 5 seconds before I display any results, just this line:

$hits  = $index-find($query);

And it returns a ton of data, not just the Document's ID.

Here: http://www.articlesbase.com/test-search2.php?q=business+consulting


Only note:
$index-find($query) actually returns only IDs and scores. It's an array 
of QueryHit objects.


QueryHit object contains only ID ans Score fields initially, but 
automatically retrieves document from an index when any stored field is 
retrieved via QueryHit property.



Is there a way to limit the number of results returned or a minimum score?


Zend_Search_Lucene needs to calculate all scores to limit search results 
by scores. So it doesn't help.


Apache Lucene has special weight implementation which returns results in 
 document id order. It may help to limit search result, but it's not 
implemented in Zend_Search_Lucene now.


PS. I also need to set 


ini_set(memory_limit,300M);


Zend_Search preloads terms dictionary index (it's usually each 128th 
term) and stores it in memory.
It looks like you have very large terms dictionary which may be produced 
by large or non-tokenized unique indexed fields.


Could I ask you to put your index (tarball or zip) somewhere for 
downloading to play with it?



With best regards,
   Alexander Veremyev.



For the script to even run.

Thanks,

Simon


Alexander Veremyev wrote:

1) Index should be optimized (have only one segment) to make search
faster.

2) Large search result is a cause of slow searching.
Do you retrieve any stored field of returned hits?

Note:
Search itself only collects documents' IDs, but retrieving any stored 
field causes full document retrieving. It hardly increases time of large 
result set retrieving.
So splitting returned result into pages and retrieving any stored info 
_only_for_current_page_ make search much more faster.


That's also good idea to store returned result (IDs and scores or only 
IDs) into an array and cache it between requests.

Documents could be retrieved with $index-getDocument($id) call.

With best regards,
Alexander Veremyev.







Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles

2007-05-13 Thread Simon Gelfand

Hi Craig,

You can see a test here with 130,000 articles indexed I am getting
slow searching - 5,6 seconds.

I have added paging + max 250 hits displayed + memory caching to speed
browsing after an initial search.

Here is an example:
http://www.articlesbase.com/test-search.php?q=business+consulting+firms (cached)
http://www.articlesbase.com/test-search.php?q=business+ (not cached)

Any ideas of speeding the search itself?

Like limit the amount of results when searching (before it returns
like 30,000 results which I slice), minimum score for a result and so
on?

Simon

On 5/9/07, Craig Slusher [EMAIL PROTECTED] wrote:

webshark27,

When you get your articles indexed, it would be really great if you
can share your experience with searching against it. I would love to
know how well the Zend implementation of Lucene handles the load.

On 5/8/07, webshark27 [EMAIL PROTECTED] wrote:

 Hi Chris,

 Thanks for the quick response.

 Doesn't the $doc = new Zend_Search_Lucene_Document(); just overwrite the
 old one?

 Also I think the $index-addDocument($doc) is filling up the memory fast, I
 don't know exactly how to play with the MergeFactor, MaxMergeDocs and
 MaxBufferedDocs effects this issue.

 I am running 10,000 each time and then commit changes - load the script
 again and running 


 Chris Blaise wrote:
 
 
It's been a few months since I worked with this but I had some weird
  errors that I'm not sure if I figured out was due to running out of memory
  or it if it was due to some weird corruption I was seeing that caused the
  script to exit.
 
The fix to my problem was to free memory.  In your case try setting
  $doc to null when you're finished with it in the loop, right after the
  $index-addDocument($doc).
 
   Chris
 
 

 --
 View this message in context: 
http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10385215
 Sent from the Zend Framework mailing list archive at Nabble.com.




--
Craig Slusher
[EMAIL PROTECTED]




--
Simon Gelfand
-
http://www.articlesbase.com
http://www.reader.co.il
http://www.articuloz.com
http://www.rusarticles.com
http://www.tripslog.com
http://www.simongelfand.com


Re: [fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles

2007-05-09 Thread Craig Slusher

webshark27,

When you get your articles indexed, it would be really great if you
can share your experience with searching against it. I would love to
know how well the Zend implementation of Lucene handles the load.

On 5/8/07, webshark27 [EMAIL PROTECTED] wrote:


Hi Chris,

Thanks for the quick response.

Doesn't the $doc = new Zend_Search_Lucene_Document(); just overwrite the
old one?

Also I think the $index-addDocument($doc) is filling up the memory fast, I
don't know exactly how to play with the MergeFactor, MaxMergeDocs and
MaxBufferedDocs effects this issue.

I am running 10,000 each time and then commit changes - load the script
again and running 


Chris Blaise wrote:


   It's been a few months since I worked with this but I had some weird
 errors that I'm not sure if I figured out was due to running out of memory
 or it if it was due to some weird corruption I was seeing that caused the
 script to exit.

   The fix to my problem was to free memory.  In your case try setting
 $doc to null when you're finished with it in the loop, right after the
 $index-addDocument($doc).

  Chris



--
View this message in context: 
http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10385215
Sent from the Zend Framework mailing list archive at Nabble.com.





--
Craig Slusher
[EMAIL PROTECTED]


[fw-general] RE: Zend_Search_Lucene - Best Practices for Indexing 100k+ articles

2007-05-08 Thread webshark27

Hi Chris,

Thanks for the quick response.

Doesn't the $doc = new Zend_Search_Lucene_Document(); just overwrite the
old one?

Also I think the $index-addDocument($doc) is filling up the memory fast, I
don't know exactly how to play with the MergeFactor, MaxMergeDocs and
MaxBufferedDocs effects this issue.

I am running 10,000 each time and then commit changes - load the script
again and running 


Chris Blaise wrote:
 
 
   It's been a few months since I worked with this but I had some weird
 errors that I'm not sure if I figured out was due to running out of memory
 or it if it was due to some weird corruption I was seeing that caused the
 script to exit.
 
   The fix to my problem was to free memory.  In your case try setting
 $doc to null when you're finished with it in the loop, right after the
 $index-addDocument($doc).
 
  Chris
 
 

-- 
View this message in context: 
http://www.nabble.com/Zend_Search_Lucene---Best-Practices-for-Indexing-100k%2B-articles-tf3712199s16154.html#a10385215
Sent from the Zend Framework mailing list archive at Nabble.com.