subject:"RE\: search performance"

Re: Search Performance with NRT

2015-05-27 Thread kiwi clive

Hi Mike,
Thanks for the very prompt and clear response. We look forward to using the new 
(new for us) Lucenene goodies :-)
Clive

  From: Michael McCandless 
 To: Lucene Users ; kiwi clive 
 
 Sent: Thursday, May 28, 2015 2:34 AM
 Subject: Re: Search Performance with NRT
   
As long as you call SM.maybeRefresh from a dedicated refresh thread
(not from a query's thread) it will work well.

You may want to use a warmer so that the new searcher is warmed before
becoming visible to incoming queries ... this ensures any lazy data
structures are initialized by the time a query sees them.

Mike McCandless

http://blog.mikemccandless.com




On Wed, May 27, 2015 at 7:16 AM, kiwi clive
 wrote:
> Hi Guys
>
> We are considering changing our Lucene indexer / search architecture from 2 
> separate JVMs to a single one to benefit from the very latest index views NRT 
> readers provide.
>
> In the past we cached our IndexSearchers to avoid cold searches every time 
> and reopened them periodically.  In the single-JVM model where we will be 
> keeping the IndexWriters open for long periods, will we still face the same 
> problem, or will calling searcherManager.maybeRefresh() periodically be 
> enough to guarantee fast searches (as well as near-real time views)?
>
> (We intend to instantiate our SearcherManager with the IndexWriter rather 
> than a Directory.)
>
>
> ThanksClive

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Search Performance with NRT

2015-05-27 Thread Michael McCandless

As long as you call SM.maybeRefresh from a dedicated refresh thread
(not from a query's thread) it will work well.

You may want to use a warmer so that the new searcher is warmed before
becoming visible to incoming queries ... this ensures any lazy data
structures are initialized by the time a query sees them.

Mike McCandless

http://blog.mikemccandless.com


On Wed, May 27, 2015 at 7:16 AM, kiwi clive
 wrote:
> Hi Guys
>
> We are considering changing our Lucene indexer / search architecture from 2 
> separate JVMs to a single one to benefit from the very latest index views NRT 
> readers provide.
>
> In the past we cached our IndexSearchers to avoid cold searches every time 
> and reopened them periodically.  In the single-JVM model where we will be 
> keeping the IndexWriters open for long periods, will we still face the same 
> problem, or will calling searcherManager.maybeRefresh() periodically be 
> enough to guarantee fast searches (as well as near-real time views)?
>
> (We intend to instantiate our SearcherManager with the IndexWriter rather 
> than a Directory.)
>
>
> ThanksClive

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-20 Thread Vitaly Funstein

If you are using stored fields in your index, consider playing with
compression settings, or perhaps turning stored field compression off
altogether. Ways to do this have been discussed in this forum on numerous
occasions. This is highly use case dependent though, as your indexing
performance may or may not suffer, as a tradeoff.

On Fri, Jun 20, 2014 at 1:19 AM, Uwe Schindler  wrote:

> Hi,
>
> > Am I correct that using SearchManager can't be used with a MultiReader
> and
> > NRT?  I would appreciate all suggestions on how to optimize our search
> > performance further. Search time has become a usability issue.
>
> Just have a SearcherManger for every index. MultiReader construction is
> cheap (it is just a wrapper, there is no overhead), so you can ask all
> searcherManagers for the actual IndexReader and build the MultiReader on
> every search request.
>
> Uwe
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

RE: search performance

2014-06-20 Thread Uwe Schindler

Hi,

> Am I correct that using SearchManager can't be used with a MultiReader and
> NRT?  I would appreciate all suggestions on how to optimize our search
> performance further. Search time has become a usability issue.

Just have a SearcherManger for every index. MultiReader construction is cheap 
(it is just a wrapper, there is no overhead), so you can ask all 
searcherManagers for the actual IndexReader and build the MultiReader on every 
search request.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-20 Thread Jamie


Greetings Lucene Users

As a follow-up to my earlier mail:

We are also using Lucene segment warmers, as per recommendation, 
segments per tier is now set to five, buffer memory is set to 
(Runtime.getRuntime().totalMemory()*.08)/1024/1024;


See below for code used to instantiate writer:

LimitTokenCountAnalyzer limitAnalyzer = new 
LimitTokenCountAnalyzer(application.getAnalyzerFactory().getAnalyzer(language, 
AnalyzerFactory.Operation.INDEX), maxPerFieldTokens);
IndexWriterConfig conf = new 
IndexWriterConfig(Version.LUCENE_46, limitAnalyzer);
TieredMergePolicy logMergePolicy = new 
TieredMergePolicy();

logMergePolicy.setSegmentsPerTier(5);
conf.setMergePolicy(logMergePolicy);
 conf.setRAMBufferSizeMB(bufferMemoryMB);
writer = new IndexWriter(fsDirectory, conf);
writer.getConfig().setMergedSegmentWarmer(readerWarmer);


This particular monster 24 core machine has 110G of RAM. I suppose one 
possibly is to load the indexes that aren' t being changed into RAM on 
startup. However, the indexes are already residing on fast SSD drives.


We're using the following JRE parameters:

-XX:+UseG1GC -XX:MaxGCPauseMillis=100 -XX:SurvivorRatio=3 
-XX:+AggressiveOpts.


Let me know if there is anything else, we can try to obtain performance 
gains.


Much appreciate

Jamie
On 2014/06/20, 9:51 AM, Jamie wrote:

Hi All

Thank you for all your suggestions. Some of the recommendations hadn't 
yet been implemented, as our code base was using older versions of 
Lucene with reduced capabilities. Thus, far, all the recommendations  
for fast search have been implemented (e.g. using pagination with 
searchAfter, DirectoryReader.openIfChanged, avoiding wrapping lucene 
scoreDoc results, option to disable sorting, etc.).


While, in some environments, search performance has improved 
significantly, in other larger ones we are unfortunately, still seeing 
1 minute - 5 minute search times. For instance, in one site, the total 
index size is 500GB with 190 million documents indexed. They are 
running a machine with 24 core and 4 SSD drives to house the indexes. 
New emails are being added to the indexes at a rate of 10 message/sec.


One area possible area for improvement: Searching is being conducted 
across several indexes. To accomplish this, on each search, a 
MultiReader is constructed, that consists of several subreaders 
created by the DirectoryReader.openIfChangedMethod. Only one of the 
indexes is updated frequently, the others are never updated.  For each 
search, a new IndexSearcher is created passed the MultiReader in the 
constructor. From what I've read, MultiReader and IndexSearcher are 
relatively lightweight and should not impact search performance. Is 
this correct? Is there a faster way to handle searching across 
multiple indexes? What is the performance impact of searching across 
multiple indexes?


Am I correct that using SearchManager can't be used with a MultiReader 
and NRT?  I would appreciate all suggestions on how to optimize our 
search performance further. Search time has become a usability issue.


Much appreciate

Jamie



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-20 Thread Jamie


Hi All

Thank you for all your suggestions. Some of the recommendations hadn't 
yet been implemented, as our code base was using older versions of 
Lucene with reduced capabilities. Thus, far, all the recommendations  
for fast search have been implemented (e.g. using pagination with 
searchAfter, DirectoryReader.openIfChanged, avoiding wrapping lucene 
scoreDoc results, option to disable sorting, etc.).


While, in some environments, search performance has improved 
significantly, in other larger ones we are unfortunately, still seeing 1 
minute - 5 minute search times. For instance, in one site, the total 
index size is 500GB with 190 million documents indexed. They are running 
a machine with 24 core and 4 SSD drives to house the indexes. New emails 
are being added to the indexes at a rate of 10 message/sec.


One area possible area for improvement: Searching is being conducted 
across several indexes. To accomplish this, on each search, a 
MultiReader is constructed, that consists of several subreaders created 
by the DirectoryReader.openIfChangedMethod. Only one of the indexes is 
updated frequently, the others are never updated.  For each search, a 
new IndexSearcher is created passed the MultiReader in the constructor. 
From what I've read, MultiReader and IndexSearcher are relatively 
lightweight and should not impact search performance. Is this correct? 
Is there a faster way to handle searching across multiple indexes? What 
is the performance impact of searching across multiple indexes?


Am I correct that using SearchManager can't be used with a MultiReader 
and NRT?  I would appreciate all suggestions on how to optimize our 
search performance further. Search time has become a usability issue.


Much appreciate

Jamie

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-06 Thread Jamie


Jon

I ended up adapting your approach. The solution involves keeping a LRU 
cache of page boundary scoredocs and their respective positions. New 
positions are added to the cache as new pages are discovered. To cut 
down on searches, when scrolling backwards and forwards, the search 
begins from nearest cached position.


Cheers

Jamie

On 2014/06/03, 3:24 PM, Jon Stewart wrote:

With regards to pagination, is there a way for you to cache the
IndexSearcher, Query, and TopDocs between user pagination requests (a
lot of webapp frameworks have object caching mechanisms)? If so, you
may have luck with code like this:

   void ensureTopDocs(final int rank) throws IOException {
 if (StartDocIndex > rank) {
   Docs = Searcher.search(SearchQuery, TOP_DOCS_WINDOW);
   StartDocIndex = 0;
 }
 int len = Docs.scoreDocs.length;
 while (StartDocIndex + len <= rank) {
   StartDocIndex += len;
   Docs = Searcher.searchAfter(Docs.scoreDocs[len - 1],
SearchQuery, TOP_DOCS_WINDOW);
   len = Docs.scoreDocs.length;
 }
   }




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: search performance

2014-06-03 Thread Toke Eskildsen

Jamie [ja...@mailarchiva.com] wrote:
> It would be nice if, in future, the Lucene API could provide a
> searchAfter that takes a position (int).

It would not really help with large result sets. At least not with the current 
underlying implementations. This is tied into your current performance problem, 
if I understand it correctly.

We seem to have isolated your performance problems to large (10M+) result sets, 
right?

Requesting the top X results in Lucene works internally by adding to a Priority 
Queue. The problem with PQs is that they work really well for small result sets 
and really bad for large result sets (note that result set refers to the 
collected documents, not to the amount of matching documents). PQs rearranges 
the internal structure each time a hit is entered that has a score >= the 
lowest known score. With millions of documents in the result set, this happens 
all the time. Abstractly there is little difference between small result sets 
and large: O(n * log n) is fine scaling. In reality the rearrangements of the 
internal heap structure only works well when it is in CPU cache.

To test this, I created the tiny project https://github.com/tokee/luso 

It simulates the workflow (for an extremely loose value of 'simulates') you 
described with extraction of a large result set by filling a PQ of a given size 
with docIDs (ints) and scores (floats) and then extracting the ordered docIDs. 
Running it with different sizes shows how the PQ deteriorates on a 4 core i7 
with 8MB level 2 cache:

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="pq 1 1000 1 100 
100 500 1000 2000 3000 4000"

Starting 1 threads with extraction method pq
   1,000 docs in mean  15 ms,66 docs/ms.
  10,000 docs in mean  47 ms,   212 docs/ms.
 100,000 docs in mean  65 ms, 1,538 docs/ms.
 500,000 docs in mean 385 ms, 1,298 docs/ms.
   1,000,000 docs in mean 832 ms, 1,201 docs/ms.
   5,000,000 docs in mean   7,566 ms,   660 docs/ms.
  10,000,000 docs in mean  16,482 ms,   606 docs/ms.
  20,000,000 docs in mean  39,481 ms,   506 docs/ms.
  30,000,000 docs in mean  80,293 ms,   373 docs/ms.
  40,000,000 docs in mean 109,537 ms,   365 docs/ms.

As can be seen, relative performance (docs/ms) drops significantly when the 
document count increases. To add insult to injury, this deterioration patters 
is optimistic as the test was the only heavy job on my computer. Running 4 of 
these tests in parallel (1 per core) we would ideally expect about the same 
speed, but instead we get

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="pq 4 1000 1 10 50 
100 500 1000 2000 3000 4000"

Starting 4 threads with extraction method pq
   1,000 docs in mean  34 ms,29 docs/ms.
  10,000 docs in mean  70 ms,   142 docs/ms.
 100,000 docs in mean 102 ms,   980 docs/ms.
 500,000 docs in mean   1,340 ms,   373 docs/ms.
   1,000,000 docs in mean   2,564 ms,   390 docs/ms.
   5,000,000 docs in mean  19,464 ms,   256 docs/ms.
  10,000,000 docs in mean  49,985 ms,   200 docs/ms.
  20,000,000 docs in mean 112,321 ms,   178 docs/ms.
(I got tired of waiting and stopped after 20M docs)

The conclusion seems clear enough: Using PQ for millions of results will take a 
long time.

So what can be done? I added an alternative implementation where all the docIDs 
and scores are collected in two parallel arrays, then merge sorted after 
collection. That gave the results

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="ip 1 1000 1 10 50 
100 500 1000 2000 3000 4000"
Starting 1 threads with extraction method ip
   1,000 docs in mean  15 ms,66 docs/ms.
  10,000 docs in mean  52 ms,   192 docs/ms.
 100,000 docs in mean  73 ms, 1,369 docs/ms.
 500,000 docs in mean 363 ms, 1,377 docs/ms.
   1,000,000 docs in mean 780 ms, 1,282 docs/ms.
   5,000,000 docs in mean   4,634 ms, 1,078 docs/ms.
  10,000,000 docs in mean   9,708 ms, 1,030 docs/ms.
  20,000,000 docs in mean  20,818 ms,   960 docs/ms.
  30,000,000 docs in mean  32,413 ms,   925 docs/ms.
  40,000,000 docs in mean  44,235 ms,   904 docs/ms.

Notice how the deterioration of relative speed is a lot less than for PQ. 
Running this with 4 threads gets us

 MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args="ip 4 1000 1 10 50 
100 500 1000 2000 3000 4000"
Starting 4 threads with extraction method ip
   1,000 docs in mean  35 ms,28 docs/ms.
  10,000 docs in mean 221 ms,45 docs/ms.
 100,000 docs in mean 162 ms,   617 docs/ms.
 500,000 docs in mean 639 ms,   782 docs/ms.
   1,000,000 docs in mean   1,388 ms,   720 docs/ms.
   5,000,000 docs in mean   8,372 ms,   597 docs/ms.
  10,000,000 docs in mean  17,933 ms,   557 docs/ms.
  20,000,000 docs in mean  36,031 ms,   555 docs/ms.
  30,000,000 docs in mean  58,257 ms,   514 docs/ms.
  4

Re: search performance

2014-06-03 Thread Jamie


Thanks Jon

I'll investigate your idea further.

It would be nice if, in future, the Lucene API could provide a 
searchAfter that takes a position (int).


Regards

Jamie

On 2014/06/03, 3:24 PM, Jon Stewart wrote:

With regards to pagination, is there a way for you to cache the
IndexSearcher, Query, and TopDocs between user pagination requests (a
lot of webapp frameworks have object caching mechanisms)? If so, you
may have luck with code like this:

   void ensureTopDocs(final int rank) throws IOException {
 if (StartDocIndex > rank) {
   Docs = Searcher.search(SearchQuery, TOP_DOCS_WINDOW);
   StartDocIndex = 0;
 }
 int len = Docs.scoreDocs.length;
 while (StartDocIndex + len <= rank) {
   StartDocIndex += len;
   Docs = Searcher.searchAfter(Docs.scoreDocs[len - 1],
SearchQuery, TOP_DOCS_WINDOW);
   len = Docs.scoreDocs.length;
 }
   }

StartDocIndex is a member variable denoting the current rank of the
first item in TopDocs ("Docs") window. I call this function before
each Document retrieval. The common case--of the user looking at the
first page of results or the user advancing to the next page--is quite
fast. But it still supports random access, albeit not in constant
time. OTOH, if your app is concurrent, most search queries will
probably be returned very quickly so the odd query that wants to jump
deep into the result set will have more of the server's resources
available to it.

Also, given the size of your result sets, you have to allocate a lot
of memory upfront which will then get gc'd after some time. From query
to query, you will have a decent amount of memory churn. This isn't
free. My guess is using Lucene's linear (search() & searchAfter())
pagination will perform faster than your current approach just based
upon not having to create such large arrays.

I'm not the Lucene expert that Robert is, but this has worked alright for me.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Jon Stewart

With regards to pagination, is there a way for you to cache the
IndexSearcher, Query, and TopDocs between user pagination requests (a
lot of webapp frameworks have object caching mechanisms)? If so, you
may have luck with code like this:

  void ensureTopDocs(final int rank) throws IOException {
if (StartDocIndex > rank) {
  Docs = Searcher.search(SearchQuery, TOP_DOCS_WINDOW);
  StartDocIndex = 0;
}
int len = Docs.scoreDocs.length;
while (StartDocIndex + len <= rank) {
  StartDocIndex += len;
  Docs = Searcher.searchAfter(Docs.scoreDocs[len - 1],
SearchQuery, TOP_DOCS_WINDOW);
  len = Docs.scoreDocs.length;
}
  }

StartDocIndex is a member variable denoting the current rank of the
first item in TopDocs ("Docs") window. I call this function before
each Document retrieval. The common case--of the user looking at the
first page of results or the user advancing to the next page--is quite
fast. But it still supports random access, albeit not in constant
time. OTOH, if your app is concurrent, most search queries will
probably be returned very quickly so the odd query that wants to jump
deep into the result set will have more of the server's resources
available to it.

Also, given the size of your result sets, you have to allocate a lot
of memory upfront which will then get gc'd after some time. From query
to query, you will have a decent amount of memory churn. This isn't
free. My guess is using Lucene's linear (search() & searchAfter())
pagination will perform faster than your current approach just based
upon not having to create such large arrays.

I'm not the Lucene expert that Robert is, but this has worked alright for me.

cheers,

Jon

On Tue, Jun 3, 2014 at 8:47 AM, Jamie  wrote:
> Robert. Thanks, I've already done a similar thing. Results on my test
> platform are encouraging..
>
>
> On 2014/06/03, 2:41 PM, Robert Muir wrote:
>>
>> Reopening for every search is not a good idea. this will have an
>> extremely high cost (not as high as what you are doing with "paging"
>> but still not good).
>>
>> Instead consider making it near-realtime, by doing this every second
>> or so instead. Look at SearcherManager for code that helps you do
>> this.
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-- 
Jon Stewart, Principal
(646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Jamie

Robert. Thanks, I've already done a similar thing. Results on my test 
platform are encouraging..


On 2014/06/03, 2:41 PM, Robert Muir wrote:

Reopening for every search is not a good idea. this will have an
extremely high cost (not as high as what you are doing with "paging"
but still not good).

Instead consider making it near-realtime, by doing this every second
or so instead. Look at SearcherManager for code that helps you do
this.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Robert Muir

Reopening for every search is not a good idea. this will have an
extremely high cost (not as high as what you are doing with "paging"
but still not good).

Instead consider making it near-realtime, by doing this every second
or so instead. Look at SearcherManager for code that helps you do
this.

On Tue, Jun 3, 2014 at 7:25 AM, Jamie  wrote:
> Robert
>
> Hmmm. why did Mike go to all the trouble of implementing NRT search, if
> we are not supposed to be using it?
>
> The user simply wants the latest result set. To me, this doesn't appear out
> of scope for the Lucene project.
>
> Jamie
>
>
> On 2014/06/03, 1:17 PM, Robert Muir wrote:
>>
>> No, you are incorrect. The point of a search engine is to return top-N
>> most relevant.
>>
>> If you insist you need to open an indexreader on every single search,
>> and then return huge amounts of docs, maybe you should use a database
>> instead.
>>
>> On Tue, Jun 3, 2014 at 6:42 AM, Jamie  wrote:
>>>
>>> Vitality / Robert
>>>
>>> I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes.
>>> Unless I am mistaken, the Lucene library's pagination mechanism, makes
>>> the
>>> assumption that you will cache the scoredocs for the entire result set.
>>> This
>>> is not practical  when you have a result set that exceeds 60M. As stated
>>> earlier, in any case, it is the first query that is slow.
>>>
>>> We do open index readers.. since we are using NRT search. Since documents
>>> are being added to the indexes on a continuous basis. When the user
>>> clicks
>>> on the Search button, the user will expect to see the latest result set.
>>> With regards to NRT search, my understanding is that we do need to open
>>> the
>>> index readers on each search operation to see the latest changes.
>>>
>>> Thus, on each search, we combine the indexreaders into a multireader, and
>>> open each reader based their corresponding writer.
>>>
>>> protected IndexReader initIndexReader() {
>>>  List readers = new LinkedList<>();
>>>  for (Writer writer : writers) {
>>>  readers.add(DirectoryReader.open(writer, true);
>>>  }
>>>  return MultiReader(readers,true);
>>> }
>>>
>>> Thank you for your ideas/suggestions.
>>>
>>> Regards
>>>
>>> Jamie
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Jamie


Robert

FYI: I've modified the code to utilize the experimental function..

DirectoryReader dirReader = 
DirectoryReader.openIfChanged(cachedDirectoryReader,writer, true);


In this case, the IndexReader won't be opened on each search, unless 
absolutely necessary.


Regards

Jamie

On 2014/06/03, 1:25 PM, Jamie wrote:



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Jamie


Robert

Hmmm. why did Mike go to all the trouble of implementing NRT search, 
if we are not supposed to be using it?


The user simply wants the latest result set. To me, this doesn't appear 
out of scope for the Lucene project.


Jamie

On 2014/06/03, 1:17 PM, Robert Muir wrote:

No, you are incorrect. The point of a search engine is to return top-N
most relevant.

If you insist you need to open an indexreader on every single search,
and then return huge amounts of docs, maybe you should use a database
instead.

On Tue, Jun 3, 2014 at 6:42 AM, Jamie  wrote:

Vitality / Robert

I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes.
Unless I am mistaken, the Lucene library's pagination mechanism, makes the
assumption that you will cache the scoredocs for the entire result set. This
is not practical  when you have a result set that exceeds 60M. As stated
earlier, in any case, it is the first query that is slow.

We do open index readers.. since we are using NRT search. Since documents
are being added to the indexes on a continuous basis. When the user clicks
on the Search button, the user will expect to see the latest result set.
With regards to NRT search, my understanding is that we do need to open the
index readers on each search operation to see the latest changes.

Thus, on each search, we combine the indexreaders into a multireader, and
open each reader based their corresponding writer.

protected IndexReader initIndexReader() {
 List readers = new LinkedList<>();
 for (Writer writer : writers) {
 readers.add(DirectoryReader.open(writer, true);
 }
 return MultiReader(readers,true);
}

Thank you for your ideas/suggestions.

Regards

Jamie



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Robert Muir

No, you are incorrect. The point of a search engine is to return top-N
most relevant.

If you insist you need to open an indexreader on every single search,
and then return huge amounts of docs, maybe you should use a database
instead.

On Tue, Jun 3, 2014 at 6:42 AM, Jamie  wrote:
> Vitality / Robert
>
> I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes.
> Unless I am mistaken, the Lucene library's pagination mechanism, makes the
> assumption that you will cache the scoredocs for the entire result set. This
> is not practical  when you have a result set that exceeds 60M. As stated
> earlier, in any case, it is the first query that is slow.
>
> We do open index readers.. since we are using NRT search. Since documents
> are being added to the indexes on a continuous basis. When the user clicks
> on the Search button, the user will expect to see the latest result set.
> With regards to NRT search, my understanding is that we do need to open the
> index readers on each search operation to see the latest changes.
>
> Thus, on each search, we combine the indexreaders into a multireader, and
> open each reader based their corresponding writer.
>
> protected IndexReader initIndexReader() {
> List readers = new LinkedList<>();
> for (Writer writer : writers) {
> readers.add(DirectoryReader.open(writer, true);
> }
> return MultiReader(readers,true);
> }
>
> Thank you for your ideas/suggestions.
>
> Regards
>
> Jamie
>
> On 2014/06/03, 12:29 PM, Vitaly Funstein wrote:
>>
>> Jamie,
>>
>> What if you were to forget for a moment the whole pagination idea, and
>> always capped your search at 1000 results for testing purposes only? This
>> is just to try and pinpoint the bottleneck here; if, regardless of the
>> query parameters, the search latency stays roughly the same and well below
>> 5 min, you now have the answer - the problem is your naive implementation
>> of pagination which results in snowballing result numbers and search
>> times,
>> the closer you get to the end of the results range. Otherwise, I would
>> focus on your query and filter next.
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Jamie


Vitality / Robert

I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes. 
Unless I am mistaken, the Lucene library's pagination mechanism, makes 
the assumption that you will cache the scoredocs for the entire result 
set. This is not practical  when you have a result set that exceeds 60M. 
As stated earlier, in any case, it is the first query that is slow.


We do open index readers.. since we are using NRT search. Since 
documents are being added to the indexes on a continuous basis. When the 
user clicks on the Search button, the user will expect to see the latest 
result set. With regards to NRT search, my understanding is that we do 
need to open the index readers on each search operation to see the 
latest changes.


Thus, on each search, we combine the indexreaders into a multireader, 
and open each reader based their corresponding writer.


protected IndexReader initIndexReader() {
List readers = new LinkedList<>();
for (Writer writer : writers) {
readers.add(DirectoryReader.open(writer, true);
}
return MultiReader(readers,true);
}

Thank you for your ideas/suggestions.

Regards

Jamie
On 2014/06/03, 12:29 PM, Vitaly Funstein wrote:

Jamie,

What if you were to forget for a moment the whole pagination idea, and
always capped your search at 1000 results for testing purposes only? This
is just to try and pinpoint the bottleneck here; if, regardless of the
query parameters, the search latency stays roughly the same and well below
5 min, you now have the answer - the problem is your naive implementation
of pagination which results in snowballing result numbers and search times,
the closer you get to the end of the results range. Otherwise, I would
focus on your query and filter next.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Vitaly Funstein

Jamie,

What if you were to forget for a moment the whole pagination idea, and
always capped your search at 1000 results for testing purposes only? This
is just to try and pinpoint the bottleneck here; if, regardless of the
query parameters, the search latency stays roughly the same and well below
5 min, you now have the answer - the problem is your naive implementation
of pagination which results in snowballing result numbers and search times,
the closer you get to the end of the results range. Otherwise, I would
focus on your query and filter next.


On Tue, Jun 3, 2014 at 3:21 AM, Jamie  wrote:

> Vitaly
>
> See below:
>
>
> On 2014/06/03, 12:09 PM, Vitaly Funstein wrote:
>
>> A couple of questions.
>>
>> 1. What are you trying to achieve by setting the current thread's priority
>> to max possible value? Is it grabbing as much CPU time as possible? In my
>> experience, mucking with thread priorities like this is at best futile,
>> and
>> at worst quite detrimental to responsiveness and overall performance of
>> the
>> system as a whole. I would remove that line.
>>
> Yes,  you are right to be worried about this, especially since thread
> priorities behave differently on different platforms.
>
>
>
>> 2. This seems suspicious:
>>
>> if (getPagination()) {
>>  max = start + length;
>>  } else {
>>  max = getMaxResults();
>>  }
>>
>> If start is at 100M, and length is 1000 - what do you think Lucene will
>> try
>> and do when you pass this max to the collector?
>>
> I dont see the problem here. The collector will start from zero to max
> results. I agree that from a performance perspective, ts not ideal to
> return all results from the beginning of the search, but the Lucene API us
> with no choice. I simply do not know the ScoreDoc to start from. If I did
> keep a record of it, then I would need to store all scoredocs for the
> entire result set. When there are 60M+ results, this can be problematic in
> terms of memory consumption. It would be far nicer if there was a
> searchAfter function that took a position as an integer.
>
> Regards
>
> Jamie
>
>
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: search performance

2014-06-03 Thread Robert Muir

Check and make sure you are not opening an indexreader for every
search. Be sure you don't do that.

On Mon, Jun 2, 2014 at 2:51 AM, Jamie  wrote:
> Greetings
>
> Despite following all the recommended optimizations (as described at
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of our
> installations, search performance has reached the point where is it
> unacceptably slow. For instance, in one environment, the total index size is
> 200GB, with 150 million documents indexed. With NRT enabled, search speed is
> roughly 5 minutes on average. The server resources are: 2x6 Core Intel CPU,
> 128GB, 2 SSD for index and RAID 0, with Linux.
>
> The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to
> 4.8.x. Is this likely to make any noticeable difference in performance?
>
> Clearly, longer term, we need to move to a distributed search model. We
> thought to take advantage of the distributed search features offered in
> Solr, however, our solution is very tightly integrated into Lucene directly
> (since Solr didn't exist when we started out). Moving to Solr now seems like
> a daunting prospect. We've also following the Katta project with interest,
> but it doesn't appear support distributed indexing, and development on it
> seems to have stalled. It would be nice if there were a distributed search
> project on the Lucene level that we could use.
>
> I realize this is a rather vague question, but are there any further
> suggestions on ways to improve search performance? We need cheap and dirty
> ideas, as well as longer term advice on a possible path forward.
>
> Much appreciate
>
> Jamie
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Jamie


Vitaly

See below:

On 2014/06/03, 12:09 PM, Vitaly Funstein wrote:

A couple of questions.

1. What are you trying to achieve by setting the current thread's priority
to max possible value? Is it grabbing as much CPU time as possible? In my
experience, mucking with thread priorities like this is at best futile, and
at worst quite detrimental to responsiveness and overall performance of the
system as a whole. I would remove that line.
Yes,  you are right to be worried about this, especially since thread 
priorities behave differently on different platforms.




2. This seems suspicious:

if (getPagination()) {
 max = start + length;
 } else {
 max = getMaxResults();
 }

If start is at 100M, and length is 1000 - what do you think Lucene will try
and do when you pass this max to the collector?
I dont see the problem here. The collector will start from zero to max 
results. I agree that from a performance perspective, ts not ideal to 
return all results from the beginning of the search, but the Lucene API 
us with no choice. I simply do not know the ScoreDoc to start from. If I 
did keep a record of it, then I would need to store all scoredocs for 
the entire result set. When there are 60M+ results, this can be 
problematic in terms of memory consumption. It would be far nicer if 
there was a searchAfter function that took a position as an integer.


Regards

Jamie






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Vitaly Funstein

A couple of questions.

1. What are you trying to achieve by setting the current thread's priority
to max possible value? Is it grabbing as much CPU time as possible? In my
experience, mucking with thread priorities like this is at best futile, and
at worst quite detrimental to responsiveness and overall performance of the
system as a whole. I would remove that line.

2. This seems suspicious:

if (getPagination()) {
max = start + length;
} else {
max = getMaxResults();
}

If start is at 100M, and length is 1000 - what do you think Lucene will try
and do when you pass this max to the collector?

On Tue, Jun 3, 2014 at 2:55 AM, Jamie  wrote:

> FYI: We are also using a multireader to search over multiple index readers.
>
> Search under a million documents yields good response times. When you get
> into the 60M territory, search slows to a crawl.
>
> On 2014/06/03, 11:47 AM, Jamie wrote:
>
>> Sure... see below:
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: search performance

2014-06-03 Thread Jamie


FYI: We are also using a multireader to search over multiple index readers.

Search under a million documents yields good response times. When you 
get into the 60M territory, search slows to a crawl.


On 2014/06/03, 11:47 AM, Jamie wrote:

Sure... see below:



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Jamie


Sure... see below:

protected void search(Query query, Filter queryFilter, Sort sort)
throws BlobSearchException {
try {
logger.debug("start search  {searchquery='" + 
getSearchQuery() + 
"',query='"+query.toString()+"',filterQuery='"+queryFilter+"',sort='"+sort+"'}");

Thread.currentThread().setPriority(Thread.MAX_PRIORITY);
results.clear();

int max;

if (getPagination()) {
max = start + length;
} else {
max = getMaxResults();
}

// release the old volume searchers
IndexReader indexReader = initIndexReader();
searcher = new IndexSearcher(indexReader,executor);
TopFieldCollector fieldCollector = 
TopFieldCollector.create(sort, max,true, false, false, true);


searcher.search(query, queryFilter, fieldCollector);

TopDocs topDocs;

if (getPagination()) {
topDocs = fieldCollector.topDocs(start,length);
} else {
topDocs = fieldCollector.topDocs();
}

int count = 0;
for (int i = 0; i < topDocs.scoreDocs.length; i++) {
if ((getMaxResults()>0 && count > getMaxResults()) || 
(getPagination() && count++>=length)) { break; }

results.add(topDocs.scoreDocs[i]);
}

totalHits = fieldCollector.getTotalHits();

logger.debug("search executed successfully {query='"+ 
getSearchQuery() + "',returnedresults='" + results.size()+ "'}");

} catch (Exception io) {
throw new BlobSearchException("failed to execute search 
query {searchquery='"+ getSearchQuery() + "}", io, logger, 
ChainedException.Level.DEBUG);

}
}
On 2014/06/03, 11:41 AM, Rob Audenaerde wrote:

Hi Jamie,

What is included in the 5 minutes?

Just the call to the searcher?

seacher.search(...) ?

Can you show a bit more of the code you use?



On Tue, Jun 3, 2014 at 11:32 AM, Jamie  wrote:


Vitaly

Thanks for the contribution. Unfortunately, we cannot use Lucene's
pagination function, because in reality the user can skip pages to start
the search at any point, not just from the end of the previous search. Even
the
first search (without any pagination), with a max of 1000 hits, takes 5
minutes to complete.

Regards

Jamie

On 2014/06/03, 10:54 AM, Vitaly Funstein wrote:


Something doesn't quite add up.

TopFieldCollector fieldCollector = TopFieldCollector.create(sort,
max,true,


false, false, true);

We use pagination, so only returning 1000 documents or so at a time.


  You say you are using pagination, yet the API you are using to create

your
collector isn't how you would utilize Lucene's built-in "pagination"
feature (unless misunderstand the API). If the max is the snippet above is
1000, then you're simply returning top 1000 docs every time you execute
your search. Otherwise... well, could you actually post a bit more of your
code that runs the search here, in particular?

Assuming that the max is much larger than 1000, however, you could call
fieldCollector.topDocs(int, int) after accumulating hits using this
collector, but this won't work multiple times per query execution,
according to the javadoc. So you either have to re-execute the full
search,
and then get the next chunk of ScoreDocs, or use the proper API for this,
one that accepts as a parameter the end of the previous page of results,
i.e. IndexSearcher.searchAfter(ScoreDoc, ...)



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Rob Audenaerde

Hi Jamie,

What is included in the 5 minutes?

Just the call to the searcher?

seacher.search(...) ?

Can you show a bit more of the code you use?



On Tue, Jun 3, 2014 at 11:32 AM, Jamie  wrote:

> Vitaly
>
> Thanks for the contribution. Unfortunately, we cannot use Lucene's
> pagination function, because in reality the user can skip pages to start
> the search at any point, not just from the end of the previous search. Even
> the
> first search (without any pagination), with a max of 1000 hits, takes 5
> minutes to complete.
>
> Regards
>
> Jamie
>
> On 2014/06/03, 10:54 AM, Vitaly Funstein wrote:
>
>> Something doesn't quite add up.
>>
>> TopFieldCollector fieldCollector = TopFieldCollector.create(sort,
>> max,true,
>>
>>> false, false, true);
>>>
>>> We use pagination, so only returning 1000 documents or so at a time.
>>>
>>>
>>>  You say you are using pagination, yet the API you are using to create
>> your
>> collector isn't how you would utilize Lucene's built-in "pagination"
>> feature (unless misunderstand the API). If the max is the snippet above is
>> 1000, then you're simply returning top 1000 docs every time you execute
>> your search. Otherwise... well, could you actually post a bit more of your
>> code that runs the search here, in particular?
>>
>> Assuming that the max is much larger than 1000, however, you could call
>> fieldCollector.topDocs(int, int) after accumulating hits using this
>> collector, but this won't work multiple times per query execution,
>> according to the javadoc. So you either have to re-execute the full
>> search,
>> and then get the next chunk of ScoreDocs, or use the proper API for this,
>> one that accepts as a parameter the end of the previous page of results,
>> i.e. IndexSearcher.searchAfter(ScoreDoc, ...)
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: search performance

2014-06-03 Thread Jamie


Vitaly

Thanks for the contribution. Unfortunately, we cannot use Lucene's 
pagination function, because in reality the user can skip pages to start 
the search at any point, not just from the end of the previous search. 
Even the
first search (without any pagination), with a max of 1000 hits, takes 5 
minutes to complete.


Regards

Jamie
On 2014/06/03, 10:54 AM, Vitaly Funstein wrote:

Something doesn't quite add up.

TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true,

false, false, true);

We use pagination, so only returning 1000 documents or so at a time.



You say you are using pagination, yet the API you are using to create your
collector isn't how you would utilize Lucene's built-in "pagination"
feature (unless misunderstand the API). If the max is the snippet above is
1000, then you're simply returning top 1000 docs every time you execute
your search. Otherwise... well, could you actually post a bit more of your
code that runs the search here, in particular?

Assuming that the max is much larger than 1000, however, you could call
fieldCollector.topDocs(int, int) after accumulating hits using this
collector, but this won't work multiple times per query execution,
according to the javadoc. So you either have to re-execute the full search,
and then get the next chunk of ScoreDocs, or use the proper API for this,
one that accepts as a parameter the end of the previous page of results,
i.e. IndexSearcher.searchAfter(ScoreDoc, ...)




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Vitaly Funstein

Something doesn't quite add up.

TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true,
> false, false, true);
>
> We use pagination, so only returning 1000 documents or so at a time.
>
>
You say you are using pagination, yet the API you are using to create your
collector isn't how you would utilize Lucene's built-in "pagination"
feature (unless misunderstand the API). If the max is the snippet above is
1000, then you're simply returning top 1000 docs every time you execute
your search. Otherwise... well, could you actually post a bit more of your
code that runs the search here, in particular?

Assuming that the max is much larger than 1000, however, you could call
fieldCollector.topDocs(int, int) after accumulating hits using this
collector, but this won't work multiple times per query execution,
according to the javadoc. So you either have to re-execute the full search,
and then get the next chunk of ScoreDocs, or use the proper API for this,
one that accepts as a parameter the end of the previous page of results,
i.e. IndexSearcher.searchAfter(ScoreDoc, ...)

Re: search performance

2014-06-03 Thread Jamie


Toke

Thanks for the contact. See below:

On 2014/06/03, 9:17 AM, Toke Eskildsen wrote:

On Tue, 2014-06-03 at 08:17 +0200, Jamie wrote:

Unfortunately, in this instance, it is a live production system, so we
cannot conduct experiments. The number is definitely accurate.

We have many different systems with a similar load that observe the same
performance issue. To my knowledge, the Lucene integration code is
fairly well optimized.

It is possible that the extreme slowness is a combination of factors,
but with a bit of luck it will boil down to a single thing. Standard
procedure it to disable features until it performs well, so

- Disable running updates

No can do.

- Limit page size

Done this.

- Limit lookup of returned fields

Done this.

- Disable highlighting

No highlighting.

- Simpler queries

They are as simple as possible.

- Whatever else you might think of
Our application has been using Lucene for seven years. It has been 
constantly optimized over that period.


I'll conduct further testing...


At some point along the way I would expect a sharp increase in
performance.


I've requested access to the indexes so that we can perform further testing.

Great.

- Toke Eskildsen, State and University Library, Denmark






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-03 Thread Toke Eskildsen

On Tue, 2014-06-03 at 08:17 +0200, Jamie wrote:
> Unfortunately, in this instance, it is a live production system, so we 
> cannot conduct experiments. The number is definitely accurate.
> 
> We have many different systems with a similar load that observe the same 
> performance issue. To my knowledge, the Lucene integration code is 
> fairly well optimized.

It is possible that the extreme slowness is a combination of factors,
but with a bit of luck it will boil down to a single thing. Standard
procedure it to disable features until it performs well, so

- Disable running updates
- Limit page size
- Limit lookup of returned fields
- Disable highlighting
- Simpler queries
- Whatever else you might think of

At some point along the way I would expect a sharp increase in
performance.

> I've requested access to the indexes so that we can perform further testing.

Great.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-02 Thread Christoph Kaser

Can you take thread stacktraces (repeatedly) during those 5 minute 
searches? That might give you (or someone on the mailing list) a clue 
where all that time is spent.
You could try using jstack for that: 
http://docs.oracle.com/javase/7/docs/technotes/tools/share/jstack.html


Regards
Christoph

Am 03.06.2014 08:17, schrieb Jamie:

Toke

Thanks for the comment.

Unfortunately, in this instance, it is a live production system, so we 
cannot conduct experiments. The number is definitely accurate.


We have many different systems with a similar load that observe the 
same performance issue. To my knowledge, the Lucene integration code 
is fairly well optimized.


I've requested access to the indexes so that we can perform further 
testing.


Regards

Jamie

On 2014/06/03, 8:09 AM, Toke Eskildsen wrote:

On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:

[200GB, 150M documents]


With NRT enabled, search speed is roughly 5 minutes on average.
The server resources are:
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

5 minutes is extremely long. Is that really the right number? I do not
see a hardware upgrade changing that with the fine machine you're using.

What is your search speed if you disable continuous updates?

When you restart the searcher, how long does the first search take?


- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-02 Thread Jamie


Toke

Thanks for the comment.

Unfortunately, in this instance, it is a live production system, so we 
cannot conduct experiments. The number is definitely accurate.


We have many different systems with a similar load that observe the same 
performance issue. To my knowledge, the Lucene integration code is 
fairly well optimized.


I've requested access to the indexes so that we can perform further testing.

Regards

Jamie

On 2014/06/03, 8:09 AM, Toke Eskildsen wrote:

On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:

[200GB, 150M documents]


With NRT enabled, search speed is roughly 5 minutes on average.
The server resources are:
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

5 minutes is extremely long. Is that really the right number? I do not
see a hardware upgrade changing that with the fine machine you're using.

What is your search speed if you disable continuous updates?

When you restart the searcher, how long does the first search take?


- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-02 Thread Toke Eskildsen

On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:

[200GB, 150M documents]

> With NRT enabled, search speed is roughly 5 minutes on average.
> The server resources are: 
> 2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

5 minutes is extremely long. Is that really the right number? I do not
see a hardware upgrade changing that with the fine machine you're using.

What is your search speed if you disable continuous updates?

When you restart the searcher, how long does the first search take?

- Toke Eskildsen, State and University Library, Denmark

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-02 Thread Tri Cao


This is an interesting performance problem and I think there is probably not
a single answer here, so I'll just layout the steps I would take to tackle this:

1. What is the variance of the query latency? You said the average is 5 minutes,
but is it due to some really bad queries or most queries have the same perf?

2. We kind of assume that index size and number of docs is the issue here.
Can you validate that assumption by trying to index with 10M, 50M, … docs
and see how worse the performance is getting as a function of size?

3. What is the average doc hits for the bad queries? If you queries matches
a lot of hits, scoring will be very expensive. While you only ask for 1000 top
scored docs, Lucene still needs to score all the hits to get that 1000 docs.
If this is the case, there could be some work around, but Iet's make sure
that it's indeed the situation we are dealing with here.

Hope this helps,
Tri

On Jun 01, 2014, at 11:50 PM, Jamie  wrote:

Greetings

Despite following all the recommended optimizations (as described at 
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of 
our installations, search performance has reached the point where is it 
unacceptably slow. For instance, in one environment, the total index 
size is 200GB, with 150 million documents indexed. With NRT enabled, 
search speed is roughly 5 minutes on average. The server resources are: 
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.


The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to 
4.8.x. Is this likely to make any noticeable difference in performance?


Clearly, longer term, we need to move to a distributed search model. We 
thought to take advantage of the distributed search features offered in 
Solr, however, our solution is very tightly integrated into Lucene 
directly (since Solr didn't exist when we started out). Moving to Solr 
now seems like a daunting prospect. We've also following the Katta 
project with interest, but it doesn't appear support distributed 
indexing, and development on it seems to have stalled. It would be nice 
if there were a distributed search project on the Lucene level that we 
could use.


I realize this is a rather vague question, but are there any further 
suggestions on ways to improve search performance? We need cheap and 
dirty ideas, as well as longer term advice on a possible path forward.


Much appreciate

Jamie

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-02 Thread Jamie

I assume you meant 1000 documents. Yes, the page size is in fact 
configurable. However, it only obtains the page size * 3. It preloads 
the following and previous page too. The point is, it only obtains the 
documents that are needed.



On 2014/06/02, 3:03 PM, Tincu Gabriel wrote:

My bad, It's using the RamDirectory as a cache and a delegate directory
that you pass in the constructor to do the disk operations, limiting the
use of the RamDirectory to files that fit a certain size. So i guess the
underlying Directory implementation will be whatever you choose it to be.
I'd still try using a MMapDirectory and see if that improves performance.
Also, regarding the pagination, you said you're retrieving 1000 documents
at a time. Does that mean that if a query matches 1 documents you want
all of them retrieved ?


On Mon, Jun 2, 2014 at 12:51 PM, Jamie  wrote:


I was under the impression that NRTCachingDirectory will instantiate an
MMapDirectory if a 64 bit platform is detected? Is this not the case?


On 2014/06/02, 2:09 PM, Tincu Gabriel wrote:


MMapDirectory will do the job for you. RamDirectory has a big warning in
the class description stating that the performance will get killed by an
index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
for RamDirectory and suitable for low update rates. MMap will use the
system RAM to cache as much of the index it can and only hit disk when the
portion of the index you're trying to access isn't cached. I'd put my
money
on switching directory implementations and see what kind of performance
gains that brings to the table.





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-02 Thread Tincu Gabriel

My bad, It's using the RamDirectory as a cache and a delegate directory
that you pass in the constructor to do the disk operations, limiting the
use of the RamDirectory to files that fit a certain size. So i guess the
underlying Directory implementation will be whatever you choose it to be.
I'd still try using a MMapDirectory and see if that improves performance.
Also, regarding the pagination, you said you're retrieving 1000 documents
at a time. Does that mean that if a query matches 1 documents you want
all of them retrieved ?

On Mon, Jun 2, 2014 at 12:51 PM, Jamie  wrote:

> I was under the impression that NRTCachingDirectory will instantiate an
> MMapDirectory if a 64 bit platform is detected? Is this not the case?
>
>
> On 2014/06/02, 2:09 PM, Tincu Gabriel wrote:
>
>> MMapDirectory will do the job for you. RamDirectory has a big warning in
>> the class description stating that the performance will get killed by an
>> index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
>> for RamDirectory and suitable for low update rates. MMap will use the
>> system RAM to cache as much of the index it can and only hit disk when the
>> portion of the index you're trying to access isn't cached. I'd put my
>> money
>> on switching directory implementations and see what kind of performance
>> gains that brings to the table.
>>
>>
>

Re: search performance

2014-06-02 Thread Jamie

I was under the impression that NRTCachingDirectory will instantiate an 
MMapDirectory if a 64 bit platform is detected? Is this not the case?


On 2014/06/02, 2:09 PM, Tincu Gabriel wrote:

MMapDirectory will do the job for you. RamDirectory has a big warning in
the class description stating that the performance will get killed by an
index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
for RamDirectory and suitable for low update rates. MMap will use the
system RAM to cache as much of the index it can and only hit disk when the
portion of the index you're trying to access isn't cached. I'd put my money
on switching directory implementations and see what kind of performance
gains that brings to the table.

Re: search performance

2014-06-02 Thread Tincu Gabriel

MMapDirectory will do the job for you. RamDirectory has a big warning in
the class description stating that the performance will get killed by an
index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
for RamDirectory and suitable for low update rates. MMap will use the
system RAM to cache as much of the index it can and only hit disk when the
portion of the index you're trying to access isn't cached. I'd put my money
on switching directory implementations and see what kind of performance
gains that brings to the table.


On Mon, Jun 2, 2014 at 11:50 AM, Jamie  wrote:

> Jack
>
> First off, thanks for applying your mind to our performance problem.
>
>
> On 2014/06/02, 1:34 PM, Jack Krupansky wrote:
>
>> Do you have enough system memory to fit the entire index in OS system
>> memory so that the OS can fully cache it instead of thrashing with I/O? Do
>> you see a lot of I/O or are the queries compute-bound?
>>
> Nice idea. The index is 200GB, the machine currently has 128GB RAM. We are
> using SSDs, but disappointingly, installing them didn't reduce search times
> to acceptable levels. I'll have to check your last question regarding
> I/O... I assume it is I/O bound, though will double check...
>
> Currently, we are using
>
> fsDirectory = new NRTCachingDirectory(fsDir, 5.0, 60.0);
>
> Are you proposing we increase maxCachedMB or use the RAMDirectory? With
> the latter, we will still need to persistent the index data to disk, as it
> is undergoing constant updates.
>
>
>> You said you have a 128GB machine, so that sounds small for your index.
>> Have you tried a 256GB machine?
>>
> Nope..didn't think it would make much of a different. I suppose, assuming
> we could store the entire index in RAM it would be helpful. How does one do
> this with Lucene, while still persisting the data?
>
>
>> How frequent are your commits for updates while doing queries?
>>
> Around ten to fifteen documents are being constantly added per second.
>
> Thank again
>
>
> Jamie
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: search performance

2014-06-02 Thread Jamie


Jack

First off, thanks for applying your mind to our performance problem.

On 2014/06/02, 1:34 PM, Jack Krupansky wrote:
Do you have enough system memory to fit the entire index in OS system 
memory so that the OS can fully cache it instead of thrashing with 
I/O? Do you see a lot of I/O or are the queries compute-bound?
Nice idea. The index is 200GB, the machine currently has 128GB RAM. We 
are using SSDs, but disappointingly, installing them didn't reduce 
search times to acceptable levels. I'll have to check your last question 
regarding I/O... I assume it is I/O bound, though will double check...


Currently, we are using

fsDirectory = new NRTCachingDirectory(fsDir, 5.0, 60.0);

Are you proposing we increase maxCachedMB or use the RAMDirectory? With 
the latter, we will still need to persistent the index data to disk, as 
it is undergoing constant updates.


You said you have a 128GB machine, so that sounds small for your 
index. Have you tried a 256GB machine?
Nope..didn't think it would make much of a different. I suppose, 
assuming we could store the entire index in RAM it would be helpful. How 
does one do this with Lucene, while still persisting the data?


How frequent are your commits for updates while doing queries?

Around ten to fifteen documents are being constantly added per second.

Thank again

Jamie


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-02 Thread Jack Krupansky

Do you have enough system memory to fit the entire index in OS system memory 
so that the OS can fully cache it instead of thrashing with I/O? Do you see 
a lot of I/O or are the queries compute-bound?


You said you have a 128GB machine, so that sounds small for your index. Have 
you tried a 256GB machine?


How frequent are your commits for updates while doing queries?

-- Jack Krupansky

-Original Message- 
From: Jamie

Sent: Monday, June 2, 2014 2:51 AM
To: java-user@lucene.apache.org
Subject: search performance

Greetings

Despite following all the recommended optimizations (as described at
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of
our installations, search performance has reached the point where is it
unacceptably slow. For instance, in one environment, the total index
size is 200GB, with 150 million documents indexed. With NRT enabled,
search speed is roughly 5 minutes on average. The server resources are:
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to
4.8.x. Is this likely to make any noticeable difference in performance?

Clearly, longer term, we need to move to a distributed search model. We
thought to take advantage of the distributed search features offered in
Solr, however, our solution is very tightly integrated into Lucene
directly (since Solr didn't exist when we started out). Moving to Solr
now seems like a daunting prospect. We've also following the Katta
project with interest, but it doesn't appear support distributed
indexing, and development on it seems to have stalled. It would be nice
if there were a distributed search project on the Lucene level that we
could use.

I realize this is a rather vague question, but are there any further
suggestions on ways to improve search performance? We need cheap and
dirty ideas, as well as longer term advice on a possible path forward.

Much appreciate

Jamie

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-02 Thread Jamie


Tom

Thanks for the offer of assistance.

On 2014/06/02, 12:02 PM, Tincu Gabriel wrote:

What kind of queries are you pushing into the index.

We are indexing regular emails + attachments.

Typical query is something like:
filter: to:mbox08 from:mbox08 cc:mbox08 bcc:mbox08 
deliveredto:mbox08 sender:mbox08 recipient:mbox08

combined with filter query "cat:email"

We also use range queries based on date.

Do they match a lot of documents ?

Yes, although we are using a  collector...

TopFieldCollector fieldCollector = TopFieldCollector.create(sort, 
max,true, false, false, true);


We use pagination, so only returning 1000 documents or so at a time.


  Do you do any sorting on the result set?

Yes

  What is the average
document size ?

approx 100KB, We are indexing email body + attachment content.

Do you have a lot of update traffic ?
Yes we have alot of update traffic, particularly in the environment i 
referred to. Is there a way to prioritize searching as apposed to update?


I suppose we could block all indexing while searching is on the go? Is 
there such as option in Lucene, or should we implement this?

What kind of schema
does your index use ?
Not sure exactly what you are referring to here. We do have alot of 
stored fields (to, from bcc, cc, etc.). The body and attachments are 
analyzed.


Regards

Jamie






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: search performance

2014-06-02 Thread Tincu Gabriel

What kind of queries are you pushing into the index. Do they match a lot of
documents ? Do you do any sorting on the result set? What is the average
document size ? Do you have a lot of update traffic ? What kind of schema
does your index use ?


On Mon, Jun 2, 2014 at 6:51 AM, Jamie  wrote:

> Greetings
>
> Despite following all the recommended optimizations (as described at
> http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of
> our installations, search performance has reached the point where is it
> unacceptably slow. For instance, in one environment, the total index size
> is 200GB, with 150 million documents indexed. With NRT enabled, search
> speed is roughly 5 minutes on average. The server resources are: 2x6 Core
> Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.
>
> The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to
> 4.8.x. Is this likely to make any noticeable difference in performance?
>
> Clearly, longer term, we need to move to a distributed search model. We
> thought to take advantage of the distributed search features offered in
> Solr, however, our solution is very tightly integrated into Lucene directly
> (since Solr didn't exist when we started out). Moving to Solr now seems
> like a daunting prospect. We've also following the Katta project with
> interest, but it doesn't appear support distributed indexing, and
> development on it seems to have stalled. It would be nice if there were a
> distributed search project on the Lucene level that we could use.
>
> I realize this is a rather vague question, but are there any further
> suggestions on ways to improve search performance? We need cheap and dirty
> ideas, as well as longer term advice on a possible path forward.
>
> Much appreciate
>
> Jamie
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: Search performance using BooleanQueries in BooleanQueries

2007-11-06 Thread Mike Klaas


On 6-Nov-07, at 3:02 PM, Paul Elschot wrote:


On Tuesday 06 November 2007 23:14:01 Mike Klaas wrote:



Wait--shouldn't the outer-most BooleanQuery provide most of this
speedup already (since it should be skipTo'ing between the nested
BooleanQueries and the outermost).  Is it the indirection and sub-
query management that is causing the performance difference, or
differences in skiptTo behaviour?


The usual Lucene answer to performance questions: it depends.

After every hit, next() needs to be called on a subquery before
skipTo() can be used to find the next hit. It is currently not  
defined which

subquery will be used for this first next().

The structure of the scorers normally follows the structure of
the BooleanQueries, so the indirection over the deep subquery
scores could well  be relevant to performance, too.

Which of these factors actually dominates performance is hard
to predict in advance. The point of skipTo() is that is tries to avoid
disk I/O as much as possible for the first time that the query is
executed. Later executions are much more likely to hit the OS cache,
and then the indirections will be more relevant to performance.

I'd like to have a good way to do a performance test on a first
query execution, in the sense that it does not hit the OS cache
for its skipTo() executions, but I have not found a good way yet.


Interesting--thanks for the thoughtful answer.

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance using BooleanQueries in BooleanQueries

2007-11-06 Thread Paul Elschot

On Tuesday 06 November 2007 23:14:01 Mike Klaas wrote:
> On 29-Oct-07, at 9:43 AM, Paul Elschot wrote:
> > On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote:
> >> +prop1:a +prop2:b +prop3:c +prop4:d +prop5:e
> >>
> >> is much faster than
> >>
> >> (+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +prop5:e)
> >>
> >> where the second one is a result from BooleanQuery in
> >> BooleanQuery, and
> >> all have Occur.MUST.
> >
> > SImplifying boolean queries like this is not available in Lucene,
> > but it
> > would have a positive effect on search performance, especially when
> > prop1:a and prop2:b have a high document frequency.
>
> Wait--shouldn't the outer-most BooleanQuery provide most of this
> speedup already (since it should be skipTo'ing between the nested
> BooleanQueries and the outermost).  Is it the indirection and sub-
> query management that is causing the performance difference, or
> differences in skiptTo behaviour?

The usual Lucene answer to performance questions: it depends.

After every hit, next() needs to be called on a subquery before
skipTo() can be used to find the next hit. It is currently not defined which 
subquery will be used for this first next().

The structure of the scorers normally follows the structure of
the BooleanQueries, so the indirection over the deep subquery
scores could well  be relevant to performance, too.

Which of these factors actually dominates performance is hard
to predict in advance. The point of skipTo() is that is tries to avoid
disk I/O as much as possible for the first time that the query is
executed. Later executions are much more likely to hit the OS cache,
and then the indirections will be more relevant to performance.

I'd like to have a good way to do a performance test on a first
query execution, in the sense that it does not hit the OS cache
for its skipTo() executions, but I have not found a good way yet.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance using BooleanQueries in BooleanQueries

2007-11-06 Thread Mike Klaas


On 29-Oct-07, at 9:43 AM, Paul Elschot wrote:


On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote:

+prop1:a +prop2:b +prop3:c +prop4:d +prop5:e

is much faster than

(+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +prop5:e)

where the second one is a result from BooleanQuery in  
BooleanQuery, and

all have Occur.MUST.



SImplifying boolean queries like this is not available in Lucene,  
but it

would have a positive effect on search performance, especially when
prop1:a and prop2:b have a high document frequency.


Wait--shouldn't the outer-most BooleanQuery provide most of this  
speedup already (since it should be skipTo'ing between the nested  
BooleanQueries and the outermost).  Is it the indirection and sub- 
query management that is causing the performance difference, or  
differences in skiptTo behaviour?


-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Search performance using BooleanQueries in BooleanQueries

2007-10-30 Thread Ard Schrijvers



> On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote:
> > Hello,
> >
> > I am seeing that a query with boolean queries in boolean 
> queries takes 
> > much longer than just a single boolean query when the 
> number of hits 
> > if fairly large. For example
> >
> > +prop1:a +prop2:b +prop3:c +prop4:d +prop5:e
> >
> > is much faster than
> >
> > (+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +prop5:e)
> >
> > where the second one is a result from BooleanQuery in BooleanQuery, 
> > and all have Occur.MUST.
> >
> > Is there a way to detect and rewrite the second inefficient query?
> > query.rewrite() does not change the query AFAICS.
> 
> SImplifying boolean queries like this is not available in 
> Lucene, but it would have a positive effect on search 
> performance, especially when prop1:a and prop2:b have a high 
> document frequency.
> 
> You could write this yourself, for example by overriding 
> BooleanQuery.rewrite(). Take care about query weights, though.

Thanks for the pointer!

Regards Ard

> 
> Regards,
> Paul Elschot
> 
> 
> >
> > thanks for any help,
> >
> > Regards Ard
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance using BooleanQueries in BooleanQueries

2007-10-29 Thread Paul Elschot

On Friday 26 October 2007 09:36:58 Ard Schrijvers wrote:
> Hello,
>
> I am seeing that a query with boolean queries in boolean queries takes
> much longer than just a single boolean query when the number of hits if
> fairly large. For example
>
> +prop1:a +prop2:b +prop3:c +prop4:d +prop5:e
>
> is much faster than
>
> (+(+(+(+prop1:a +prop2:b) +prop3:c) +prop4:d) +prop5:e)
>
> where the second one is a result from BooleanQuery in BooleanQuery, and
> all have Occur.MUST.
>
> Is there a way to detect and rewrite the second inefficient query?
> query.rewrite() does not change the query AFAICS.

SImplifying boolean queries like this is not available in Lucene, but it
would have a positive effect on search performance, especially when
prop1:a and prop2:b have a high document frequency.

You could write this yourself, for example by overriding 
BooleanQuery.rewrite(). Take care about query weights, though.

Regards,
Paul Elschot


>
> thanks for any help,
>
> Regards Ard



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance question

2007-09-06 Thread Mike Klaas



On 6-Sep-07, at 4:41 AM, makkhar wrote:



Hi,

   I have an index which contains more than 20K documents. Each  
document has

the following structure :

field : ID (Index and store)  typical value  
- "1000"

field : parameterName(index and store)  typical value -
"/mcp/data/parent1/parent2/child1/child2/status"
field : parameterValue(index and not store)typical value - "draft"

When I search for a term which results in "all" the documents getting
returned, the search time is more than 1 sec. I have still not done
hits.doc(), which I understand, would be even worse.

My problem is, I am expecting the search itself to happen in the  
order of a
few milliseconds irrespective of the number of documents it  
matched. Am I

expecting too much ?


20K docs is not very many.  I would expect a simply TermQuery to be  
on the order of milliseconds, _after_ the OS has cached the index in  
memory.  Does the time improve after some warmup?


-MIke


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance question

2007-09-06 Thread Grant Ingersoll


Have a look at http://wiki.apache.org/lucene-java/BasicsOfPerformance

Are you opening the IndexSearcher every time you search, even when no  
documents have changed?


-Grant

On Sep 6, 2007, at 7:41 AM, makkhar wrote:



Hi,

   I have an index which contains more than 20K documents. Each  
document has

the following structure :

field : ID (Index and store)  typical value  
- "1000"

field : parameterName(index and store)  typical value -
"/mcp/data/parent1/parent2/child1/child2/status"
field : parameterValue(index and not store)typical value - "draft"

When I search for a term which results in "all" the documents getting
returned, the search time is more than 1 sec. I have still not done
hits.doc(), which I understand, would be even worse.

My problem is, I am expecting the search itself to happen in the  
order of a
few milliseconds irrespective of the number of documents it  
matched. Am I

expecting too much ?
--
View this message in context: http://www.nabble.com/Search- 
performance-question-tf4391551.html#a12520740

Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance question

2007-09-06 Thread Mark Miller

Your not expecting too much. On cheap hardware I watch searches on over 
5 mil + docs that match every doc come back in under a second. Able to 
post your search code?


makkhar wrote:

Hi,

   I have an index which contains more than 20K documents. Each document has
the following structure :

field : ID (Index and store)  typical value - "1000"
field : parameterName(index and store)  typical value -
"/mcp/data/parent1/parent2/child1/child2/status"
field : parameterValue(index and not store)typical value - "draft"

When I search for a term which results in "all" the documents getting
returned, the search time is more than 1 sec. I have still not done
hits.doc(), which I understand, would be even worse.

My problem is, I am expecting the search itself to happen in the order of a
few milliseconds irrespective of the number of documents it matched. Am I
expecting too much ?
  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-21 Thread M A


It is indeed alot faster ...

Will use that one now ..

hits = searcher.search(query, new Sort(new
SortField(null,SortField.DOC,true)));


That is completing in under a sec for pretty much all the queries ..




On 8/22/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:


On 8/21/06, M A <[EMAIL PROTECTED]> wrote:
> I still dont get this,  How would i do this, so i can try it out ..


http://lucene.apache.org/java/docs/api/org/apache/lucene/search/SortField.html#SortField(java.lang.String,%20int,%20boolean)

new Sort(new SortField(null,SortField.DOC,true)


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search
server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-21 Thread Yonik Seeley


On 8/21/06, M A <[EMAIL PROTECTED]> wrote:

I still dont get this,  How would i do this, so i can try it out ..


http://lucene.apache.org/java/docs/api/org/apache/lucene/search/SortField.html#SortField(java.lang.String,%20int,%20boolean)

new Sort(new SortField(null,SortField.DOC,true)


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-21 Thread M A

I still dont get this,  How would i do this, so i can try it out ..

is

searcher.search(query, new Sort(SortField.DOC))

..correct this would return stuff in the order of the documents, so how
would i reverse this, i mean the later documents appearing fisrt ..

searcher.search(query, new Sort()

How do you get document number descending .. ?? for the sort that is

On 8/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 8/21/06, M A <[EMAIL PROTECTED]> wrote:
> Yeah I tried looking this up,
>
> If i wanted to do it by document id (highest docs first) , does this
mean
> doing something like
>
> hits = searcher.search(query, new Sort(new SortFeild(DOC, true); // or
> something like that,
>
> is this way of sorting any different performance wise to what i was
doing
> before ..

Definitely a lot faster if you don't warm up and re-use your searchers.
Sorting by docid doesn't require the FieldCache, so you don't get the
first-search penalty.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search
server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-21 Thread Yonik Seeley


On 8/21/06, M A <[EMAIL PROTECTED]> wrote:

Yeah I tried looking this up,

If i wanted to do it by document id (highest docs first) , does this mean
doing something like

hits = searcher.search(query, new Sort(new SortFeild(DOC, true); // or
something like that,

is this way of sorting any different performance wise to what i was doing
before ..


Definitely a lot faster if you don't warm up and re-use your searchers.
Sorting by docid doesn't require the FieldCache, so you don't get the
first-search penalty.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-21 Thread M A

Yeah I tried looking this up,

If i wanted to do it by document id (highest docs first) , does this mean
doing something like

hits = searcher.search(query, new Sort(new SortFeild(DOC, true); // or
something like that,

is this way of sorting any different performance wise to what i was doing
before ..

On 8/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

On 8/20/06, M A <[EMAIL PROTECTED]> wrote:
> The index is already built in date order i.e. the older documents appear
> first in the index, what i am trying to achieve is however the latest
> documents appearing first in the search results ..  without the sort ..
i
> think they appear by relevance .. well thats what it looked like ..

You can specify a Sort by internal lucene docid (forward or reverse).
That's your fastest and least memory intensive option if the docs are
indexed in date order.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search
server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-21 Thread Yonik Seeley


On 8/20/06, M A <[EMAIL PROTECTED]> wrote:

The index is already built in date order i.e. the older documents appear
first in the index, what i am trying to achieve is however the latest
documents appearing first in the search results ..  without the sort .. i
think they appear by relevance .. well thats what it looked like ..


You can specify a Sort by internal lucene docid (forward or reverse).
That's your fastest and least memory intensive option if the docs are
indexed in date order.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-21 Thread Yonik Seeley


 public void search(Weight weight,
org.apache.lucene.search.Filterfilter, final HitCollector results)
throws IOException {
 HitCollector collector = new HitCollector() {
 public final void collect(int doc, float score) {
 try {
 String str = reader.document(doc).get("sid");
 results.collect(doc, Float.parseFloat(str));
 } catch(Exception e) {


Ahhh... that explains things.
Retrieving documents is much slower than using Lucene's indicies.
If you want to do something like this, use FunctionQuery or use the FieldCache.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-21 Thread M A


Ok this is what i have done so far ->

static class MyIndexSearcher extends IndexSearcher {
IndexReader reader = null;
public MyIndexSearcher(IndexReader r) {
super(r);
reader = r;
}
public void search(Weight weight,
org.apache.lucene.search.Filterfilter, final HitCollector results)
throws IOException {
HitCollector collector = new HitCollector() {
public final void collect(int doc, float score) {
try {
// System.err.println(" doc " + doc + " score " +
score );
String str = reader.document(doc).get("sid");
results.collect(doc, Float.parseFloat(str));
} catch(Exception e) {

}
}
};

Scorer scorer = weight.scorer(reader);
if (scorer == null)
return;
scorer.score(collector);
}

   };


Which is essentially an overriden method, although not fully optimized im
sure there is a way to make it quicker .. my timing has gone down to sub, 5
secs a query, not ideal but definately better than what i was getting before
..

In fact some searches now complete in under an sec .. which is a definate
result ..

The reason for doing it this way is simple .. the field sid stores a long
value that is the epoch, therefore the larger this value the more recent the
story and hence .. the higher it should be in the ranking ..

I guess the only bottleneck now is reading the value from the field .. since
for the multifield queries that value gets called (collect(int doc, float
score) ) a hell of a lot of times ..

now just have to find a way to eliminate low scoring ones .. and i am set ..


Thanx




On 8/20/06, M A <[EMAIL PROTECTED]> wrote:


 The index is already built in date order i.e. the older documents appear
first in the index, what i am trying to achieve is however the latest
documents appearing first in the search results ..  without the sort .. i
think they appear by relevance .. well thats what it looked like ..

I am looking at the scoring as we speak,



On 8/20/06, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> About luke... I don't know about command-line interfaces, but if you
> copy
> your index to a different machine and use Luke there. I do this between
> Linux and Windows boxes all the time. Or, if you can mount the remote
> drive
> so you can see it, you can just use Luke to browse to it and open it up.
> You
> may have some latency though.
>
> See below...
>
> On 8/20/06, M A <[EMAIL PROTECTED]> wrote:
> >
> > Ok I get your point, this still however means the first search on the
> new
> > searcher will take a huge amount of time .. given that this is
> happening
> > now
> > ..
>
>
> You can fire one or several canned queries at the searcher whenever you
> open
> a new one. That way the first time a *user* hits the box, the warm-up
> will
> already have happened. Note that the same searcher can be used by
> multiple
> threads...
>
>
> i.e. new search -> new query -> get hits ->20+ secs ..  this happens
> every 5
> > mins or so ..
> >
> > although subsequent searches may be quicker ..
> >
> > Am i to assume for a first search the amount of  time is ok -> ..
> seems
> > like
> > a long time to me ..?
> >
> > The other thing is the sorting is fixed .. it never changes .. it is
> > always
> > sorted by the same field ..
>
>
> Assuming that you still have performance issues, you could think about
> building your index in pre-sorted order an just avoiding the sorting all
> together. The internal Lucene document IDs are then your sort order (a
> newly
> added doc hast an ID that is always greater than any existing doc ID). I
>
> don't know details of your problem space, but this might be relatively
> easy You won't want to return things in relevance order in that
> case. In
> fact, you probably don't want relevance in place at all since your
> sorting
> doesn't change I think a ConstantScoreQuery  might work for you
> here.
>
> But I wouldn't go there unless you have evidence that your sort is
> slowing
> you down, which is easy enough to verify by just taking it out. Don't
> bother
> with any of this until you re-use your reader though
>
> i just built the entire index and it still takes ages .,..
>
>
> The search took ages? Or building the index? If the former, then
> rebuilding
> the index is irrelevant, it's the first time you use a searcher that
> counts.
>
> On 8/20/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> > >
> > >
> > > : This is because the index is updated every 5 mins or so, due to
> the
> > > incoming
> > > : feed of stories ..
> > > :
> > > : When you say iteration, i take it you mean, search request, well
> for
> > > each
> > > : search that is conducted I create a new one .. search reader that
> is
> > ..
> > >
> > > yeah ... i ment iteration of your test.  don't do that.
> > >
>

Re: Search Performance Problem 16 sec for 250K docs

2006-08-20 Thread M A

The index is already built in date order i.e. the older documents appear
first in the index, what i am trying to achieve is however the latest
documents appearing first in the search results ..  without the sort .. i
think they appear by relevance .. well thats what it looked like ..

I am looking at the scoring as we speak,

On 8/20/06, Erick Erickson <[EMAIL PROTECTED]> wrote:

About luke... I don't know about command-line interfaces, but if you copy
your index to a different machine and use Luke there. I do this between
Linux and Windows boxes all the time. Or, if you can mount the remote
drive
so you can see it, you can just use Luke to browse to it and open it up.
You
may have some latency though.

See below...

On 8/20/06, M A <[EMAIL PROTECTED]> wrote:
>
> Ok I get your point, this still however means the first search on the
new
> searcher will take a huge amount of time .. given that this is happening
> now
> ..

You can fire one or several canned queries at the searcher whenever you
open
a new one. That way the first time a *user* hits the box, the warm-up will
already have happened. Note that the same searcher can be used by multiple
threads...

i.e. new search -> new query -> get hits ->20+ secs ..  this happens every
5
> mins or so ..
>
> although subsequent searches may be quicker ..
>
> Am i to assume for a first search the amount of  time is ok -> .. seems
> like
> a long time to me ..?
>
> The other thing is the sorting is fixed .. it never changes .. it is
> always
> sorted by the same field ..

Assuming that you still have performance issues, you could think about
building your index in pre-sorted order an just avoiding the sorting all
together. The internal Lucene document IDs are then your sort order (a
newly
added doc hast an ID that is always greater than any existing doc ID). I
don't know details of your problem space, but this might be relatively
easy You won't want to return things in relevance order in that case.
In
fact, you probably don't want relevance in place at all since your sorting
doesn't change I think a ConstantScoreQuery  might work for you here.

But I wouldn't go there unless you have evidence that your sort is slowing
you down, which is easy enough to verify by just taking it out. Don't
bother
with any of this until you re-use your reader though

i just built the entire index and it still takes ages .,..

The search took ages? Or building the index? If the former, then
rebuilding
the index is irrelevant, it's the first time you use a searcher that
counts.

On 8/20/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> >
> > : This is because the index is updated every 5 mins or so, due to the
> > incoming
> > : feed of stories ..
> > :
> > : When you say iteration, i take it you mean, search request, well for
> > each
> > : search that is conducted I create a new one .. search reader that is
> ..
> >
> > yeah ... i ment iteration of your test.  don't do that.
> >
> > if the index is updated every 5 minutes, then open a new searcher
every
> 5
> > minutes -- and reuse it for theentire 5 minutes.  if it's updated
> > "sparadically throughout the day" then open a search, and keep using
it
> > untill the index is udated, then open a new one.
> >
> > reusing an indexsearcher as long as possible is one of biggest factors
> of
> > Lucene applications.
> >
> > :
> > :
> > :
> > : On 8/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> > : >
> > : >
> > : > : hits = searcher.search(query, new Sort("sid", true));
> > : >
> > : > you don't show where searcher is initialized, and you don't
clarify
> > how
> > : > you are timing your multiple iterations -- i'm going to guess that
> you
> > are
> > : > opening a new searcher every iteration right?
> > : >
> > : > sorting on a field requires pre-computing an array of information
> for
> > that
> > : > field -- this is both time and space expensive, and is cached per
> > : > IndexReader/IndexSearcher -- so if you reuse the same searcher and
> > time
> > : > multiple iterations you'll find that hte first iteration might be
> > somewhat
> > : > slow, but the rest should be very fast.
> > : >
> > : >
> > : >
> > : > -Hoss
> > : >
> > : >
> > : >
> -
> > : > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > : > For additional commands, e-mail: [EMAIL PROTECTED]
> > : >
> > : >
> > :
> >
> >
> >
> > -Hoss
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>

Re: Search Performance Problem 16 sec for 250K docs

2006-08-20 Thread Erick Erickson

Talk about mails crossing in the aether.. wrote my resonse before seeing
the last two...

Sounds like you're on track.

Erick

On 8/20/06, Erick Erickson <[EMAIL PROTECTED]> wrote:

About luke... I don't know about command-line interfaces, but if you copy
your index to a different machine and use Luke there. I do this between
Linux and Windows boxes all the time. Or, if you can mount the remote drive
so you can see it, you can just use Luke to browse to it and open it up. You
may have some latency though.

See below...

On 8/20/06, M A <[EMAIL PROTECTED]> wrote:
>
> Ok I get your point, this still however means the first search on the
> new
> searcher will take a huge amount of time .. given that this is happening
> now
> ..

You can fire one or several canned queries at the searcher whenever you
open a new one. That way the first time a *user* hits the box, the warm-up
will already have happened. Note that the same searcher can be used by
multiple threads...

i.e. new search -> new query -> get hits ->20+ secs ..  this happens every
> 5
> mins or so ..
>
> although subsequent searches may be quicker ..
>
> Am i to assume for a first search the amount of  time is ok -> .. seems
> like
> a long time to me ..?
>
> The other thing is the sorting is fixed .. it never changes .. it is
> always
> sorted by the same field ..

Assuming that you still have performance issues, you could think about
building your index in pre-sorted order an just avoiding the sorting all
together. The internal Lucene document IDs are then your sort order (a newly
added doc hast an ID that is always greater than any existing doc ID). I
don't know details of your problem space, but this might be relatively
easy You won't want to return things in relevance order in that case. In
fact, you probably don't want relevance in place at all since your sorting
doesn't change I think a ConstantScoreQuery  might work for you here.

But I wouldn't go there unless you have evidence that your sort is slowing
you down, which is easy enough to verify by just taking it out. Don't bother
with any of this until you re-use your reader though

i just built the entire index and it still takes ages .,..

The search took ages? Or building the index? If the former, then
rebuilding the index is irrelevant, it's the first time you use a searcher
that counts.

On 8/20/06, Chris Hostetter <[EMAIL PROTECTED] > wrote:
> >
> >
> > : This is because the index is updated every 5 mins or so, due to the
> > incoming
> > : feed of stories ..
> > :
> > : When you say iteration, i take it you mean, search request, well for
>
> > each
> > : search that is conducted I create a new one .. search reader that is
> ..
> >
> > yeah ... i ment iteration of your test.  don't do that.
> >
> > if the index is updated every 5 minutes, then open a new searcher
> every 5
> > minutes -- and reuse it for theentire 5 minutes.  if it's updated
> > "sparadically throughout the day" then open a search, and keep using
> it
> > untill the index is udated, then open a new one.
> >
> > reusing an indexsearcher as long as possible is one of biggest factors
> of
> > Lucene applications.
> >
> > :
> > :
> > :
> > : On 8/19/06, Chris Hostetter < [EMAIL PROTECTED]> wrote:
> > : >
> > : >
> > : > : hits = searcher.search(query, new Sort("sid", true));
> > : >
> > : > you don't show where searcher is initialized, and you don't
> clarify
> > how
> > : > you are timing your multiple iterations -- i'm going to guess that
> you
> > are
> > : > opening a new searcher every iteration right?
> > : >
> > : > sorting on a field requires pre-computing an array of information
> for
> > that
> > : > field -- this is both time and space expensive, and is cached per
> > : > IndexReader/IndexSearcher -- so if you reuse the same searcher and
> > time
> > : > multiple iterations you'll find that hte first iteration might be
> > somewhat
> > : > slow, but the rest should be very fast.
> > : >
> > : >
> > : >
> > : > -Hoss
> > : >
> > : >
> > : >
> -
> > : > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > : > For additional commands, e-mail: [EMAIL PROTECTED]
> > : >
> > : >
> > :
> >
> >
> >
> > -Hoss
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>
>

Re: Search Performance Problem 16 sec for 250K docs

2006-08-20 Thread Erick Erickson

About luke... I don't know about command-line interfaces, but if you copy
your index to a different machine and use Luke there. I do this between
Linux and Windows boxes all the time. Or, if you can mount the remote drive
so you can see it, you can just use Luke to browse to it and open it up. You
may have some latency though.

See below...

On 8/20/06, M A <[EMAIL PROTECTED]> wrote:

Ok I get your point, this still however means the first search on the new
searcher will take a huge amount of time .. given that this is happening
now
..

You can fire one or several canned queries at the searcher whenever you open
a new one. That way the first time a *user* hits the box, the warm-up will
already have happened. Note that the same searcher can be used by multiple
threads...

i.e. new search -> new query -> get hits ->20+ secs ..  this happens every 5

mins or so ..

although subsequent searches may be quicker ..

Am i to assume for a first search the amount of  time is ok -> .. seems
like
a long time to me ..?

The other thing is the sorting is fixed .. it never changes .. it is
always
sorted by the same field ..

Assuming that you still have performance issues, you could think about
building your index in pre-sorted order an just avoiding the sorting all
together. The internal Lucene document IDs are then your sort order (a newly
added doc hast an ID that is always greater than any existing doc ID). I
don't know details of your problem space, but this might be relatively
easy You won't want to return things in relevance order in that case. In
fact, you probably don't want relevance in place at all since your sorting
doesn't change I think a ConstantScoreQuery  might work for you here.

But I wouldn't go there unless you have evidence that your sort is slowing
you down, which is easy enough to verify by just taking it out. Don't bother
with any of this until you re-use your reader though

i just built the entire index and it still takes ages .,..

The search took ages? Or building the index? If the former, then rebuilding
the index is irrelevant, it's the first time you use a searcher that counts.

On 8/20/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:

>
>
> : This is because the index is updated every 5 mins or so, due to the
> incoming
> : feed of stories ..
> :
> : When you say iteration, i take it you mean, search request, well for
> each
> : search that is conducted I create a new one .. search reader that is
..
>
> yeah ... i ment iteration of your test.  don't do that.
>
> if the index is updated every 5 minutes, then open a new searcher every
5
> minutes -- and reuse it for theentire 5 minutes.  if it's updated
> "sparadically throughout the day" then open a search, and keep using it
> untill the index is udated, then open a new one.
>
> reusing an indexsearcher as long as possible is one of biggest factors
of
> Lucene applications.
>
> :
> :
> :
> : On 8/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> : >
> : >
> : > : hits = searcher.search(query, new Sort("sid", true));
> : >
> : > you don't show where searcher is initialized, and you don't clarify
> how
> : > you are timing your multiple iterations -- i'm going to guess that
you
> are
> : > opening a new searcher every iteration right?
> : >
> : > sorting on a field requires pre-computing an array of information
for
> that
> : > field -- this is both time and space expensive, and is cached per
> : > IndexReader/IndexSearcher -- so if you reuse the same searcher and
> time
> : > multiple iterations you'll find that hte first iteration might be
> somewhat
> : > slow, but the rest should be very fast.
> : >
> : >
> : >
> : > -Hoss
> : >
> : >
> : >
-
> : > To unsubscribe, e-mail: [EMAIL PROTECTED]
> : > For additional commands, e-mail: [EMAIL PROTECTED]
> : >
> : >
> :
>
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Search Performance Problem 16 sec for 250K docs

2006-08-20 Thread M A


Just ran some tests .. it appears that the problem is in the sorting ..

i.e.

//hits = searcher.search(query, new Sort("sid", true));-> 17 secs
//hits = searcher.search(query, new Sort("sid", false)); -> 17 secs
hits = searcher.search(query);-> less than 1 sec ..

am trying something out .. .. will keep you posted





On 8/20/06, Erik Hatcher <[EMAIL PROTECTED]> wrote:


This is why a warming strategy like Solr takes is very valuable.  The
searchable index is always serving up requests as fast as Lucene
works, which is achieved by warming a new IndexSearcher with searches/
sorts/filter creating/etc before it is swapped into use.

   Erik


On Aug 20, 2006, at 5:35 AM, M A wrote:

> Ok I get your point, this still however means the first search on
> the new
> searcher will take a huge amount of time .. given that this is
> happening now
> ..
>
> i.e. new search -> new query -> get hits ->20+ secs ..  this
> happens every 5
> mins or so ..
>
> although subsequent searches may be quicker ..
>
> Am i to assume for a first search the amount of  time is ok -> ..
> seems like
> a long time to me ..?
>
> The other thing is the sorting is fixed .. it never changes .. it
> is always
> sorted by the same field ..
>
> i just built the entire index and it still takes ages .,..
>
>
>
>
>
>
>
>
> On 8/20/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>>
>>
>> : This is because the index is updated every 5 mins or so, due to the
>> incoming
>> : feed of stories ..
>> :
>> : When you say iteration, i take it you mean, search request, well
>> for
>> each
>> : search that is conducted I create a new one .. search reader
>> that is ..
>>
>> yeah ... i ment iteration of your test.  don't do that.
>>
>> if the index is updated every 5 minutes, then open a new searcher
>> every 5
>> minutes -- and reuse it for theentire 5 minutes.  if it's updated
>> "sparadically throughout the day" then open a search, and keep
>> using it
>> untill the index is udated, then open a new one.
>>
>> reusing an indexsearcher as long as possible is one of biggest
>> factors of
>> Lucene applications.
>>
>> :
>> :
>> :
>> : On 8/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>> : >
>> : >
>> : > : hits = searcher.search(query, new Sort("sid", true));
>> : >
>> : > you don't show where searcher is initialized, and you don't
>> clarify
>> how
>> : > you are timing your multiple iterations -- i'm going to guess
>> that you
>> are
>> : > opening a new searcher every iteration right?
>> : >
>> : > sorting on a field requires pre-computing an array of
>> information for
>> that
>> : > field -- this is both time and space expensive, and is cached per
>> : > IndexReader/IndexSearcher -- so if you reuse the same searcher
>> and
>> time
>> : > multiple iterations you'll find that hte first iteration might be
>> somewhat
>> : > slow, but the rest should be very fast.
>> : >
>> : >
>> : >
>> : > -Hoss
>> : >
>> : >
>> : >
>> -
>> : > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> : > For additional commands, e-mail: [EMAIL PROTECTED]
>> : >
>> : >
>> :
>>
>>
>>
>> -Hoss
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-20 Thread Erik Hatcher

This is why a warming strategy like Solr takes is very valuable.  The  
searchable index is always serving up requests as fast as Lucene  
works, which is achieved by warming a new IndexSearcher with searches/ 
sorts/filter creating/etc before it is swapped into use.


Erik


On Aug 20, 2006, at 5:35 AM, M A wrote:

Ok I get your point, this still however means the first search on  
the new
searcher will take a huge amount of time .. given that this is  
happening now

..

i.e. new search -> new query -> get hits ->20+ secs ..  this  
happens every 5

mins or so ..

although subsequent searches may be quicker ..

Am i to assume for a first search the amount of  time is ok -> ..  
seems like

a long time to me ..?

The other thing is the sorting is fixed .. it never changes .. it  
is always

sorted by the same field ..

i just built the entire index and it still takes ages .,..








On 8/20/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: This is because the index is updated every 5 mins or so, due to the
incoming
: feed of stories ..
:
: When you say iteration, i take it you mean, search request, well  
for

each
: search that is conducted I create a new one .. search reader  
that is ..


yeah ... i ment iteration of your test.  don't do that.

if the index is updated every 5 minutes, then open a new searcher  
every 5

minutes -- and reuse it for theentire 5 minutes.  if it's updated
"sparadically throughout the day" then open a search, and keep  
using it

untill the index is udated, then open a new one.

reusing an indexsearcher as long as possible is one of biggest  
factors of

Lucene applications.

:
:
:
: On 8/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: >
: >
: > : hits = searcher.search(query, new Sort("sid", true));
: >
: > you don't show where searcher is initialized, and you don't  
clarify

how
: > you are timing your multiple iterations -- i'm going to guess  
that you

are
: > opening a new searcher every iteration right?
: >
: > sorting on a field requires pre-computing an array of  
information for

that
: > field -- this is both time and space expensive, and is cached per
: > IndexReader/IndexSearcher -- so if you reuse the same searcher  
and

time
: > multiple iterations you'll find that hte first iteration might be
somewhat
: > slow, but the rest should be very fast.
: >
: >
: >
: > -Hoss
: >
: >
: >  
-

: > To unsubscribe, e-mail: [EMAIL PROTECTED]
: > For additional commands, e-mail: [EMAIL PROTECTED]
: >
: >
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-20 Thread M A


Ok I get your point, this still however means the first search on the new
searcher will take a huge amount of time .. given that this is happening now
..

i.e. new search -> new query -> get hits ->20+ secs ..  this happens every 5
mins or so ..

although subsequent searches may be quicker ..

Am i to assume for a first search the amount of  time is ok -> .. seems like
a long time to me ..?

The other thing is the sorting is fixed .. it never changes .. it is always
sorted by the same field ..

i just built the entire index and it still takes ages .,..








On 8/20/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: This is because the index is updated every 5 mins or so, due to the
incoming
: feed of stories ..
:
: When you say iteration, i take it you mean, search request, well for
each
: search that is conducted I create a new one .. search reader that is ..

yeah ... i ment iteration of your test.  don't do that.

if the index is updated every 5 minutes, then open a new searcher every 5
minutes -- and reuse it for theentire 5 minutes.  if it's updated
"sparadically throughout the day" then open a search, and keep using it
untill the index is udated, then open a new one.

reusing an indexsearcher as long as possible is one of biggest factors of
Lucene applications.

:
:
:
: On 8/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: >
: >
: > : hits = searcher.search(query, new Sort("sid", true));
: >
: > you don't show where searcher is initialized, and you don't clarify
how
: > you are timing your multiple iterations -- i'm going to guess that you
are
: > opening a new searcher every iteration right?
: >
: > sorting on a field requires pre-computing an array of information for
that
: > field -- this is both time and space expensive, and is cached per
: > IndexReader/IndexSearcher -- so if you reuse the same searcher and
time
: > multiple iterations you'll find that hte first iteration might be
somewhat
: > slow, but the rest should be very fast.
: >
: >
: >
: > -Hoss
: >
: >
: > -
: > To unsubscribe, e-mail: [EMAIL PROTECTED]
: > For additional commands, e-mail: [EMAIL PROTECTED]
: >
: >
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-19 Thread Chris Hostetter


: This is because the index is updated every 5 mins or so, due to the incoming
: feed of stories ..
:
: When you say iteration, i take it you mean, search request, well for each
: search that is conducted I create a new one .. search reader that is ..

yeah ... i ment iteration of your test.  don't do that.

if the index is updated every 5 minutes, then open a new searcher every 5
minutes -- and reuse it for theentire 5 minutes.  if it's updated
"sparadically throughout the day" then open a search, and keep using it
untill the index is udated, then open a new one.

reusing an indexsearcher as long as possible is one of biggest factors of
Lucene applications.

:
:
:
: On 8/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
: >
: >
: > : hits = searcher.search(query, new Sort("sid", true));
: >
: > you don't show where searcher is initialized, and you don't clarify how
: > you are timing your multiple iterations -- i'm going to guess that you are
: > opening a new searcher every iteration right?
: >
: > sorting on a field requires pre-computing an array of information for that
: > field -- this is both time and space expensive, and is cached per
: > IndexReader/IndexSearcher -- so if you reuse the same searcher and time
: > multiple iterations you'll find that hte first iteration might be somewhat
: > slow, but the rest should be very fast.
: >
: >
: >
: > -Hoss
: >
: >
: > -
: > To unsubscribe, e-mail: [EMAIL PROTECTED]
: > For additional commands, e-mail: [EMAIL PROTECTED]
: >
: >
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-19 Thread M A


yes there is a new searcher opened each time a search is conducted,

This is because the index is updated every 5 mins or so, due to the incoming
feed of stories ..

When you say iteration, i take it you mean, search request, well for each
search that is conducted I create a new one .. search reader that is ..



On 8/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: hits = searcher.search(query, new Sort("sid", true));

you don't show where searcher is initialized, and you don't clarify how
you are timing your multiple iterations -- i'm going to guess that you are
opening a new searcher every iteration right?

sorting on a field requires pre-computing an array of information for that
field -- this is both time and space expensive, and is cached per
IndexReader/IndexSearcher -- so if you reuse the same searcher and time
multiple iterations you'll find that hte first iteration might be somewhat
slow, but the rest should be very fast.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-19 Thread Chris Hostetter


: hits = searcher.search(query, new Sort("sid", true));

you don't show where searcher is initialized, and you don't clarify how
you are timing your multiple iterations -- i'm going to guess that you are
opening a new searcher every iteration right?

sorting on a field requires pre-computing an array of information for that
field -- this is both time and space expensive, and is cached per
IndexReader/IndexSearcher -- so if you reuse the same searcher and time
multiple iterations you'll find that hte first iteration might be somewhat
slow, but the rest should be very fast.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Performance Problem 16 sec for 250K docs

2006-08-19 Thread M A

what i am measuring is this

Analyzer analyzer = new StandardAnalyzer(new String[]{});

   if(fldArray.length > 1)
   {
 BooleanClause.Occur[] flags = {BooleanClause.Occur.SHOULD,
BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD,
BooleanClause.Occur.SHOULD};
 query = MultiFieldQueryParser.parse(queryString, fldArray, flags,
analyzer); //parse the
   }
   else
   {
 query = new QueryParser("tags", analyzer).parse(queryString);
 System.err.println("QUERY IS " + query.toString());

   }

long ts = System.currentTimeMillis();

   hits = searcher.search(query, new Sort("sid", true));
   java.util.Set terms = new java.util.HashSet();
   query.extractTerms(terms);
   Object[] strTerms = terms.toArray();

long ta = System.currentTimeMillis();

System.err.println("Retrived Hits in " + (ta - ts));

recent figures for this ..

Retrived Hits in 26974 (msec)
Retrived Hits in 61415 (msec)

The query itself is just a word, eg "apple"

The index is contructed as follows ..

// this process runs as a daemon process ...

   iwriter = new IndexWriter(INDEX_PATH, new
StandardAnalyzer(new String[]{}), false);
   Document doc = new Document();
   doc.add(new Field("id", String.valueOf(story.getId()),
Field.Store.YES, Field.Index.NO));
   doc.add(new
Field("sid",String.valueOf(story.getDate().getTime()),
Field.Store.YES, Field.Index.UN_TOKENIZED));
   String tags = getTags(story.getId());
   doc.add(new Field("tags", tags, Field.Store.YES,
Field.Index.TOKENIZED));
   doc.add(new Field("headline", story.getHeadline(),
Field.Store.YES, Field.Index.TOKENIZED));
   doc.add(new Field("blurb", story.getBlurb(), Field.Store.YES,
Field.Index.TOKENIZED));
   doc.add(new Field("content", story.getContent(),
Field.Store.YES, Field.Index.TOKENIZED));
   doc.add(new Field("catid", String.valueOf(story.getCategory()),
Field.Store.YES, Field.Index.TOKENIZED));
   iwriter.addDocument(doc);

// then iwriter.close()

optimize just runs once a day, after some deletions .

The tags are select words  .. in total about 20K different ones in
combination ..

so story.getTags() -> Returns a string of the type "XXX YYY ZZZ YYY CCC DDD"
story.getId() -> returns a long
story.sid -> thats a long too
story.getContent() -> returns text in most cases sometimes its blank
story.getHeadline() -> returns text usually about 512 chars
story.getBlurb() -> returns text about 255 chars
story.getCatid() -> returns a long

that covers both sections i.e. the read and the write ..

I did look at luke, but unfortunately the docs dont seem to refer to a
commandline interface to it (unless i missed something).. This is running on
a headless box ..

Cheers

Mohammed.

On 8/19/06, Erick Erickson <[EMAIL PROTECTED]> wrote:

This is a lnggg time, I think you're right, it's excessive.

What are you timing? The time to complete the search (i.e. get a Hits
object
back) or the total time to assemble the response? Why I ask is that the
Hits
object is designed to return the fir st100 or so docs efficiently. Every
100
docs or so, it re-executes the query. So, if you're returning a large
result
set, the using the Hits object to iterate over them, this could account
for
your time. Use a HitCollector instead... But do note this from the javadoc
for hitcollector

. For good search performance, implementations of this method should not
call Searcher.doc(int)<
file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/search/Searcher.html#doc%28int%29
>or
IndexReader.document(int)<
file:///C:/lucene_1.9.1/docs/api/org/apache/lucene/index/IndexReader.html#document%28int%29
>on
every document number encountered. Doing so can slow searches by an
order
of magnitude or more.
-

FWIW, I have indexes larger that 1G that return in far less time than you
are indicating, through three layers and constructing web pages in the
meantime. It contains over 800K documents and the response time is around
a
second (haven't timed it lately). This includes 5-way sorts.

You might also either get a copy of Luke and have it explain exactly what
the parse does or use one of the query exlain calls (sorry, don't remember
them off the top of my head) to see what query is *actually* submitted and
whether it's what you expect.

Are you using wildcards? They also have an effect on query speed.

If none of this applies, perhaps you could post the query and the how the
index is constructed. If you haven't already gotten a copy of Luke, I
heartily recommend it

Hope this helps
Erick

On 8/19/06, M A <[EMAIL PROTECTED]> wrote:
>
> Hi there,
>
> I have an index with about 250K document, to be indexed full text.
>
> there are 2 types of searches carried out, 1. using 1 field, the other
> using
> 4 .. for a query string ...
>
> given the nature of the queries required, all stop words are maintained
in
> the index, thereby allowing for phrasal queries, (this i

Re: Search Performance Problem 16 sec for 250K docs

2006-08-19 Thread Erick Erickson


This is a lnggg time, I think you're right, it's excessive.

What are you timing? The time to complete the search (i.e. get a Hits object
back) or the total time to assemble the response? Why I ask is that the Hits
object is designed to return the fir st100 or so docs efficiently. Every 100
docs or so, it re-executes the query. So, if you're returning a large result
set, the using the Hits object to iterate over them, this could account for
your time. Use a HitCollector instead... But do note this from the javadoc
for hitcollector


. For good search performance, implementations of this method should not
call 
Searcher.doc(int)or
IndexReader.document(int)on
every document number encountered. Doing so can slow searches by an
order
of magnitude or more.
-

FWIW, I have indexes larger that 1G that return in far less time than you
are indicating, through three layers and constructing web pages in the
meantime. It contains over 800K documents and the response time is around a
second (haven't timed it lately). This includes 5-way sorts.

You might also either get a copy of Luke and have it explain exactly what
the parse does or use one of the query exlain calls (sorry, don't remember
them off the top of my head) to see what query is *actually* submitted and
whether it's what you expect.

Are you using wildcards? They also have an effect on query speed.

If none of this applies, perhaps you could post the query and the how the
index is constructed. If you haven't already gotten a copy of Luke, I
heartily recommend it

Hope this helps
Erick

On 8/19/06, M A <[EMAIL PROTECTED]> wrote:


Hi there,

I have an index with about 250K document, to be indexed full text.

there are 2 types of searches carried out, 1. using 1 field, the other
using
4 .. for a query string ...

given the nature of the queries required, all stop words are maintained in
the index, thereby allowing for phrasal queries, (this is a requirement)
..

So search I am using the following ..

if(fldArray.length > 1)
{
  // use four fields
  BooleanClause.Occur[] flags = {BooleanClause.Occur.SHOULD,
BooleanClause.Occur.SHOULD, BooleanClause.Occur.SHOULD,
BooleanClause.Occur.SHOULD};
  query = MultiFieldQueryParser.parse(queryString, fldArray, flags,
analyzer); //parse the
}
else
{
  //use only 1 field
  query = new QueryParser("tags", analyzer).parse(queryString);
}


When i search on the 4 fields the average search time is 16 sec ..
When i search on the 1 field the average search time is 9 secs ...

The Analyzer used for both searching and indexing is
Analyzer analyzer = new StandardAnalyzer(new String[]{});

The index size is about a 1GB ..

The documents vary in size some are less than 1K max size is about 5k



Is there anything I can do to make this faster 16 secs is just not
acceptable ..

Machine : 512MB, celeron 2600 ...  Lucene 2.0

I could go for a bigger machine but wanted to make sure that the problem
was
not something I was doing, given 250K is not that large a figure ..

Please Help

Thanx

Mo

Re: search performance benchmarks

2006-06-27 Thread heritrix . lucene


Hi,

Or Lucene is more like Google in this sense, meaning that the time

doesn't depend on the size of the matched result

i found that it takes long time if the result set is bigger(upto 25 sec for
29 M results). But for smaller resultset of size approx 10,000 it takes
approx. 200 ms.





On 6/27/06, Vladimir Olenin <[EMAIL PROTECTED]> wrote:


Thanks, Mike. This info is actually quite helpful. What is 'times 10
rule' you are refering to?

Also, I wonder how Lucene is handling the growth of the result set
returned by the query? In the various search engine implementations I
did myself for several projects that was one of the things which made
reponse time to grow with the size of the result set. Eg, does this
happen with Lucene or not?:

- query 1, returns 1000 ranked results, exec. time is 0.5s
- query 2, returns 1 ranked results, exec. time is 0.7s
- query 3, returns 10 ranked results, exec. time is 1.0s
- query 4, returns 100 ranked results, exec. time is 3.0s
- query 5, returns 1000 ranked results, exec. time is 10.0s

By 'ranked results' I mean you can retrieve 'top X' 'best matched'
documents.

Or Lucene is more like Google in this sense, meaning that the time
doesn't depend on the size of the matched result set and the
implementation can statistically (or somehow else) deduce approximate
size of the full result set, while not actually counting every single
document in the set (eg, 'search query returned _approximately_ 54
million documents').

Yet another question would be what is the best book (if there are more
than one), that can be recommended as an introduction as well as
'in-depth' coverage of the latest version of Lucene?

Thanks everyone for answering this post - your feedback is very helpful!

Vlad

-Original Message-
From: Mike Streeton [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 27, 2006 2:59 AM
To: java-user@lucene.apache.org
Subject: RE: search performance benchmarks

We recently ran some benchmarks on Linux with 4 xeon cpus and 2gb of
heap (not that this was needed). We managed to easily get 1000 term
based queries a second, this including the query execution time and
retrieving the top 10 documents from the index. We did notice some
contention as adding more clients (threads) kept the same average
execution time but increased the max processing time for some queries.
So the addition of clients caused a queue to build up, but the results
were still sub second with 100 clients, simultaneously executing queries
and using the times 10 rule, this would represent 1000 connected users.

Mike

www.ardentia.com the home of NetSearch
-Original Message-
From: Wang, Jeff [mailto:[EMAIL PROTECTED]
Sent: 26 June 2006 19:50
To: java-user@lucene.apache.org
Subject: RE: search performance benchmarks

Performance varies a lot, and depends upon the number of indexes, the
number of fields, and the CPU/memory configuration.  For myself, a 65Gb
source indexed to 1Gb (or so) returns single term queries (oh yeah, the
query makeup also matters a lot) in sub seconds on a Intel dual
processor (each is 3.6Ghz I think.)  I frankly haven't tested out
scalability yet.

Jeff
Emptoris, Inc.

-Original Message-
From: Vladimir Olenin [mailto:[EMAIL PROTECTED]
Sent: Monday, June 26, 2006 7:56 AM
To: java-user@lucene.apache.org
Subject: search performance benchmarks

Hi,

I'm evaluating Lucene right now to use as a base for one open source
project. I found some _indexing_ benchmarks on the lucene website
(http://lucene.apache.org/java/docs/benchmarks.html), but, after a short
browsing, couldn't find any 'runtime' performance benchmarks (Query
speed). Only one of the benchmarks contained some reference to the query
execution... Is there any other source of benchmarks I can refer to? Or
probably some heruistic rule that can help to estimate query execution
time?

Thanks.

Vlad

PS: let me know if details of the searched data will help in evaluation
- I'll be able to provide what I know at this point...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: search performance benchmarks

2006-06-27 Thread Vladimir Olenin

Thanks, Mike. This info is actually quite helpful. What is 'times 10
rule' you are refering to?

Also, I wonder how Lucene is handling the growth of the result set
returned by the query? In the various search engine implementations I
did myself for several projects that was one of the things which made
reponse time to grow with the size of the result set. Eg, does this
happen with Lucene or not?:

- query 1, returns 1000 ranked results, exec. time is 0.5s
- query 2, returns 1 ranked results, exec. time is 0.7s
- query 3, returns 10 ranked results, exec. time is 1.0s
- query 4, returns 100 ranked results, exec. time is 3.0s
- query 5, returns 1000 ranked results, exec. time is 10.0s

By 'ranked results' I mean you can retrieve 'top X' 'best matched'
documents.

Or Lucene is more like Google in this sense, meaning that the time
doesn't depend on the size of the matched result set and the
implementation can statistically (or somehow else) deduce approximate
size of the full result set, while not actually counting every single
document in the set (eg, 'search query returned _approximately_ 54
million documents').

Yet another question would be what is the best book (if there are more
than one), that can be recommended as an introduction as well as
'in-depth' coverage of the latest version of Lucene?

Thanks everyone for answering this post - your feedback is very helpful!

Vlad

-Original Message-
From: Mike Streeton [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 27, 2006 2:59 AM
To: java-user@lucene.apache.org
Subject: RE: search performance benchmarks

We recently ran some benchmarks on Linux with 4 xeon cpus and 2gb of
heap (not that this was needed). We managed to easily get 1000 term
based queries a second, this including the query execution time and
retrieving the top 10 documents from the index. We did notice some
contention as adding more clients (threads) kept the same average
execution time but increased the max processing time for some queries.
So the addition of clients caused a queue to build up, but the results
were still sub second with 100 clients, simultaneously executing queries
and using the times 10 rule, this would represent 1000 connected users.

Mike

www.ardentia.com the home of NetSearch
-Original Message-
From: Wang, Jeff [mailto:[EMAIL PROTECTED]
Sent: 26 June 2006 19:50
To: java-user@lucene.apache.org
Subject: RE: search performance benchmarks

Performance varies a lot, and depends upon the number of indexes, the
number of fields, and the CPU/memory configuration.  For myself, a 65Gb
source indexed to 1Gb (or so) returns single term queries (oh yeah, the
query makeup also matters a lot) in sub seconds on a Intel dual
processor (each is 3.6Ghz I think.)  I frankly haven't tested out
scalability yet.

Jeff
Emptoris, Inc.

-Original Message-
From: Vladimir Olenin [mailto:[EMAIL PROTECTED]
Sent: Monday, June 26, 2006 7:56 AM
To: java-user@lucene.apache.org
Subject: search performance benchmarks

Hi,
 
I'm evaluating Lucene right now to use as a base for one open source
project. I found some _indexing_ benchmarks on the lucene website
(http://lucene.apache.org/java/docs/benchmarks.html), but, after a short
browsing, couldn't find any 'runtime' performance benchmarks (Query
speed). Only one of the benchmarks contained some reference to the query
execution... Is there any other source of benchmarks I can refer to? Or
probably some heruistic rule that can help to estimate query execution
time?
 
Thanks.
 
Vlad
 
PS: let me know if details of the searched data will help in evaluation
- I'll be able to provide what I know at this point...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: search performance benchmarks

2006-06-26 Thread Mike Streeton

We recently ran some benchmarks on Linux with 4 xeon cpus and 2gb of
heap (not that this was needed). We managed to easily get 1000 term
based queries a second, this including the query execution time and
retrieving the top 10 documents from the index. We did notice some
contention as adding more clients (threads) kept the same average
execution time but increased the max processing time for some queries.
So the addition of clients caused a queue to build up, but the results
were still sub second with 100 clients, simultaneously executing queries
and using the times 10 rule, this would represent 1000 connected users.

Mike

www.ardentia.com the home of NetSearch
-Original Message-
From: Wang, Jeff [mailto:[EMAIL PROTECTED] 
Sent: 26 June 2006 19:50
To: java-user@lucene.apache.org
Subject: RE: search performance benchmarks

Performance varies a lot, and depends upon the number of indexes, the
number of fields, and the CPU/memory configuration.  For myself, a 65Gb
source indexed to 1Gb (or so) returns single term queries (oh yeah, the
query makeup also matters a lot) in sub seconds on a Intel dual
processor (each is 3.6Ghz I think.)  I frankly haven't tested out
scalability yet.

Jeff
Emptoris, Inc.

-Original Message-
From: Vladimir Olenin [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 26, 2006 7:56 AM
To: java-user@lucene.apache.org
Subject: search performance benchmarks

Hi,
 
I'm evaluating Lucene right now to use as a base for one open source
project. I found some _indexing_ benchmarks on the lucene website
(http://lucene.apache.org/java/docs/benchmarks.html), but, after a short
browsing, couldn't find any 'runtime' performance benchmarks (Query
speed). Only one of the benchmarks contained some reference to the query
execution... Is there any other source of benchmarks I can refer to? Or
probably some heruistic rule that can help to estimate query execution
time?
 
Thanks.
 
Vlad
 
PS: let me know if details of the searched data will help in evaluation
- I'll be able to provide what I know at this point...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: search performance benchmarks

2006-06-26 Thread Wang, Jeff

Performance varies a lot, and depends upon the number of indexes, the
number of fields, and the CPU/memory configuration.  For myself, a 65Gb
source indexed to 1Gb (or so) returns single term queries (oh yeah, the
query makeup also matters a lot) in sub seconds on a Intel dual
processor (each is 3.6Ghz I think.)  I frankly haven't tested out
scalability yet.

Jeff
Emptoris, Inc.

-Original Message-
From: Vladimir Olenin [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 26, 2006 7:56 AM
To: java-user@lucene.apache.org
Subject: search performance benchmarks

Hi,
 
I'm evaluating Lucene right now to use as a base for one open source
project. I found some _indexing_ benchmarks on the lucene website
(http://lucene.apache.org/java/docs/benchmarks.html), but, after a short
browsing, couldn't find any 'runtime' performance benchmarks (Query
speed). Only one of the benchmarks contained some reference to the query
execution... Is there any other source of benchmarks I can refer to? Or
probably some heruistic rule that can help to estimate query execution
time?
 
Thanks.
 
Vlad
 
PS: let me know if details of the searched data will help in evaluation
- I'll be able to provide what I know at this point...

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search performance degrades by order of magnitude when using SortField.

2006-05-29 Thread Chris Hostetter

: default Sort.RELEVANCE, query response time is ~6ms.  However, when I
: specify a sort, e.g. Searcher.search( query, new Sort( "mydatefield" )
:  ), the query response time gets multiplied by a factor of 10 or 20.
...
: do a top-K ranking over the same number of raw hits.   The performance
: gets disproportionately worse as I increase the number of parallel
: threads that query the same Searcher object.

How many sequential queries are you running against the same Searcher
instance? ... the performance drop you are seeing may be a result of each
of those threads trying to build the same FieldCache on your sort field in
parrallel.

being 10x or 20x slower sounds like a lot .. but 10x 6ms is still only
60ms :) .. have you timed how long it takes just to build a FieldCache on
that field?

: Also, in my previous experience with sorting by a field in Lucene, I
: seem to remember there being a preload time when you first search with
: a sort by field, sometimes taking 30 seconds or so to load all of the
: field's values into the in-memory cache associated with the Searcher
: object.  This initial preload time doesn't seem to be happening in my
: case -- does that mean that for some reason Lucene is not caching the
: field values?

that's the FieldCache initialization i was refering to -- it's based on
reusing the same instenad of IndexReader (or IndexSearcher), as long as
you are using the same instance over and over you'll reuse the
FieldCache and only pay that cost once (or maybe N times if you have N
parrallel query threads and they all try to hit the FieldCache
immediately).

30 seconds sounds extremely long though ... you may be remembering
incorrectly how significant the penalty was.

: I have an index of 1 million documents, taking up about 1.7G of
: diskspace.  I specify -Xmx2000m when running my java search
: application.

the big issue when sorting on a field is what type of data is in that
field: is it a int? a long? a String? .. if it is a String how often does
the same String value appear for multiple documents? .. these all affect
how much RAM the FieldCache takes up.  you mentioned sorting by date, did
you store the date as a String? in what format? with what precision?




-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene trunk update question. WAS RE: search performance enhancement

2005-09-26 Thread Erik Hatcher



On Sep 26, 2005, at 3:10 AM, Paul Elschot wrote:

I used my bug votes already. I hope more people will do that, hint:
http://issues.apache.org/jira/secure/BrowseProject.jspa?id=12310110

Is there a way to view the open issues sorted by number of votes?


There is the "Popular Issues" view:








-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene trunk update question. WAS RE: search performance enhancement

2005-09-26 Thread Paul Elschot

Otis,

On Monday 26 September 2005 00:37, Otis Gospodnetic wrote:
> As Erik Hatcher noted in another email (it might have been on the -dev
> list), we'll go through JIRA before making the next release and try to
> push the patches like this one into the core.  Personally, it has been

I used my bug votes already. I hope more people will do that, hint:
http://issues.apache.org/jira/secure/BrowseProject.jspa?id=12310110

Is there a way to view the open issues sorted by number of votes?

> bugging me to see all these nice contributions sitting outside the core
> all this time, and I know it doesn't make contributors feel good.  I
> may have more time to focus on Lucene in the near future, in which case
> I'll do some of what I described above.

In this case, the code requires skipTo() on all scorers. In the trunk, all
scorers have this. There may be a small query search performance hit
due to this, and the code has not been used widely.
To avoid having a dependence on a possible weak spot in performance,
it might be good to let this filter code wait until that is settled.

Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene trunk update question. WAS RE: search performance enhancement

2005-09-25 Thread Otis Gospodnetic

As Erik Hatcher noted in another email (it might have been on the -dev
list), we'll go through JIRA before making the next release and try to
push the patches like this one into the core.  Personally, it has been
bugging me to see all these nice contributions sitting outside the core
all this time, and I know it doesn't make contributors feel good.  I
may have more time to focus on Lucene in the near future, in which case
I'll do some of what I described above.

Otis

--- Paul Elschot <[EMAIL PROTECTED]> wrote:

> On Thursday 22 September 2005 11:45, Peter Gelderbloem wrote:
> > I noticed you posted links to bits of code below.
> > How and when will these be made part of the svn trunk?
> > It seems universally useful and I think should be the default
> behaviour.
> 
> The code has the same licence as the trunk, so you can basically
> use it as you like. Off course, more improvements would be
> appreciated.
> I don't know when/if it will become part of the trunk.
> 
> Regards,
> Paul Elschot
> 
> > Peter Gelderbloem 
> > -Original Message-
> > From: Paul Elschot [mailto:[EMAIL PROTECTED] 
> > Sent: 21 September 2005 19:16
> > To: java-user@lucene.apache.org
> > Subject: Re: search performance enhancement
> > 
> > On Wednesday 21 September 2005 03:29, John Wang wrote:
> > > Hi Paul and other gurus:
> > > 
> > > In a related topic, seems lucene is scoring documents that would
> hit
> > in a 
> > > "prohibited" boolean clause, e.g. NOT field:value. It doesn't
> seem to
> > make 
> > > sense to score a document that is to be excluded from the result.
> Is
> > this a 
> > > difficult thing to fix?
> > 
> > In the svn trunk (the development version) excluded clauses of
> > BooleanQuery are no not scored.
> > The default scoring method is so cheap on CPU time that this is
> hardly
> > noticeable.
> > 
> > > Also in Paul's ealier comment: "... unless you have large
> indexes,
> > this will 
> > > probably not make much difference", what is "large" in this case.
> 
> > 
> > A BitSet takes one bit per document in the index, and a
> SortedVIntList
> > takes about 1 byte per document passing the filter.
> > So when you need many filters, each passing less than 1/8 of the
> indexed
> > docs, a SortedVIntList takes less memory.
> >  
> > > In our case, say millions of documents match some query, but
> after
> > either a 
> > > Filter is applied or after a NOT query (e.g. query with a
> > NOT/prohibited 
> > > clause) is applied, the resulting hit list has only 10 documents.
> > Seems the 
> > > millions of calls to score() is wasted, some of the score() call
> can
> > be 
> > > computational intensive.
> > 
> > The FilteredQuery posted here:
> > http://issues.apache.org/jira/browse/LUCENE-330
> > will score only the documents passing the filter, for both a BitSet
> and
> > a SortedVIntList filter. Btw. this is the new place for the related
> > filter
> > implementations:
> > http://issues.apache.org/jira/browse/LUCENE-328
> > 
> > > 
> > > Am I on the right track?
> > 
> > That's easy: you're using Lucene.
> > 
> > Regards,
> > Paul Elschot
> > 
> > 
> > > Thanks
> > > 
> > > -John
> > > 
> > > On 8/19/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > > > 
> > > > On Friday 19 August 2005 18:09, John Wang wrote:
> > > > > Hi Paul:
> > > > >
> > > > > Thanks for the pointer.
> > > > >
> > > > > How would I extend from the patch you submitted to filter out
> > > > > more documents not using a Filter. e.g.
> > > > >
> > > > > have a class to skip documents based on a docID: boolean
> > > > > isValid(int docID)
> > > > >
> > > > > My problem is I want to discard documents at query time
> without
> > > > > having to construct a BitSet via filter. I have my own memory
> > > > > structure to help me skip documents based the query and a
> docid.
> > > > 
> > > > Basically, you need to implement the next() and skipTo(int
> > targetDoc)
> > > > methods from Scorer on your memory structure. They are somewhat
> > > > redundant (skipTo(doc() + 1) has the same semantics as next(),
> > except
> > > > initially), but that is for a mix of hi

Re: Lucene trunk update question. WAS RE: search performance enhancement

2005-09-22 Thread Paul Elschot

On Thursday 22 September 2005 11:45, Peter Gelderbloem wrote:
> I noticed you posted links to bits of code below.
> How and when will these be made part of the svn trunk?
> It seems universally useful and I think should be the default behaviour.

The code has the same licence as the trunk, so you can basically
use it as you like. Off course, more improvements would be appreciated.
I don't know when/if it will become part of the trunk.

Regards,
Paul Elschot

> Peter Gelderbloem 
> -Original Message-
> From: Paul Elschot [mailto:[EMAIL PROTECTED] 
> Sent: 21 September 2005 19:16
> To: java-user@lucene.apache.org
> Subject: Re: search performance enhancement
> 
> On Wednesday 21 September 2005 03:29, John Wang wrote:
> > Hi Paul and other gurus:
> > 
> > In a related topic, seems lucene is scoring documents that would hit
> in a 
> > "prohibited" boolean clause, e.g. NOT field:value. It doesn't seem to
> make 
> > sense to score a document that is to be excluded from the result. Is
> this a 
> > difficult thing to fix?
> 
> In the svn trunk (the development version) excluded clauses of
> BooleanQuery are no not scored.
> The default scoring method is so cheap on CPU time that this is hardly
> noticeable.
> 
> > Also in Paul's ealier comment: "... unless you have large indexes,
> this will 
> > probably not make much difference", what is "large" in this case. 
> 
> A BitSet takes one bit per document in the index, and a SortedVIntList
> takes about 1 byte per document passing the filter.
> So when you need many filters, each passing less than 1/8 of the indexed
> docs, a SortedVIntList takes less memory.
>  
> > In our case, say millions of documents match some query, but after
> either a 
> > Filter is applied or after a NOT query (e.g. query with a
> NOT/prohibited 
> > clause) is applied, the resulting hit list has only 10 documents.
> Seems the 
> > millions of calls to score() is wasted, some of the score() call can
> be 
> > computational intensive.
> 
> The FilteredQuery posted here:
> http://issues.apache.org/jira/browse/LUCENE-330
> will score only the documents passing the filter, for both a BitSet and
> a SortedVIntList filter. Btw. this is the new place for the related
> filter
> implementations:
> http://issues.apache.org/jira/browse/LUCENE-328
> 
> > 
> > Am I on the right track?
> 
> That's easy: you're using Lucene.
> 
> Regards,
> Paul Elschot
> 
> 
> > Thanks
> > 
> > -John
> > 
> > On 8/19/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > > 
> > > On Friday 19 August 2005 18:09, John Wang wrote:
> > > > Hi Paul:
> > > >
> > > > Thanks for the pointer.
> > > >
> > > > How would I extend from the patch you submitted to filter out
> > > > more documents not using a Filter. e.g.
> > > >
> > > > have a class to skip documents based on a docID: boolean
> > > > isValid(int docID)
> > > >
> > > > My problem is I want to discard documents at query time without
> > > > having to construct a BitSet via filter. I have my own memory
> > > > structure to help me skip documents based the query and a docid.
> > > 
> > > Basically, you need to implement the next() and skipTo(int
> targetDoc)
> > > methods from Scorer on your memory structure. They are somewhat
> > > redundant (skipTo(doc() + 1) has the same semantics as next(),
> except
> > > initially), but that is for a mix of historical and performance
> reasons.
> > > Have a look at how this is done in the posted FilteredQuery class
> > > for SortedVIntList and BitSet.
> > > With only isValid(docId) these next() and skipTo() methods would
> > > have to count over the document numbers, which is less than ideal.
> > > 
> > > When you use the posted code, iirc it is only necessary to implement
> the
> > > SkipFilter interface on your memory structure. One can use that
> interface
> > > to build/cache such memory structures using an IndexReader, and
> > > from there the DocNrSkipper interface will do the rest (of the top
> of
> > > my head).
> > > One slight problem with the current Lucene implementation is that
> > > java.lang.BitSet is not interface.
> > > 
> > > Regards,
> > > Paul Elschot.
> > > 
> > > > Thanks
> > > >
> > > > -John
> > > >
> > > > On 8/16/05, Paul Elschot <[EMAIL PRO

Lucene trunk update question. WAS RE: search performance enhancement

2005-09-22 Thread Peter Gelderbloem

I noticed you posted links to bits of code below.
How and when will these be made part of the svn trunk?
It seems universally useful and I think should be the default behaviour.

Peter Gelderbloem 
-Original Message-
From: Paul Elschot [mailto:[EMAIL PROTECTED] 
Sent: 21 September 2005 19:16
To: java-user@lucene.apache.org
Subject: Re: search performance enhancement

On Wednesday 21 September 2005 03:29, John Wang wrote:
> Hi Paul and other gurus:
> 
> In a related topic, seems lucene is scoring documents that would hit
in a 
> "prohibited" boolean clause, e.g. NOT field:value. It doesn't seem to
make 
> sense to score a document that is to be excluded from the result. Is
this a 
> difficult thing to fix?

In the svn trunk (the development version) excluded clauses of
BooleanQuery are no not scored.
The default scoring method is so cheap on CPU time that this is hardly
noticeable.

> Also in Paul's ealier comment: "... unless you have large indexes,
this will 
> probably not make much difference", what is "large" in this case. 

A BitSet takes one bit per document in the index, and a SortedVIntList
takes about 1 byte per document passing the filter.
So when you need many filters, each passing less than 1/8 of the indexed
docs, a SortedVIntList takes less memory.
 
> In our case, say millions of documents match some query, but after
either a 
> Filter is applied or after a NOT query (e.g. query with a
NOT/prohibited 
> clause) is applied, the resulting hit list has only 10 documents.
Seems the 
> millions of calls to score() is wasted, some of the score() call can
be 
> computational intensive.

The FilteredQuery posted here:
http://issues.apache.org/jira/browse/LUCENE-330
will score only the documents passing the filter, for both a BitSet and
a SortedVIntList filter. Btw. this is the new place for the related
filter
implementations:
http://issues.apache.org/jira/browse/LUCENE-328

> 
> Am I on the right track?

That's easy: you're using Lucene.

Regards,
Paul Elschot


> Thanks
> 
> -John
> 
> On 8/19/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > 
> > On Friday 19 August 2005 18:09, John Wang wrote:
> > > Hi Paul:
> > >
> > > Thanks for the pointer.
> > >
> > > How would I extend from the patch you submitted to filter out
> > > more documents not using a Filter. e.g.
> > >
> > > have a class to skip documents based on a docID: boolean
> > > isValid(int docID)
> > >
> > > My problem is I want to discard documents at query time without
> > > having to construct a BitSet via filter. I have my own memory
> > > structure to help me skip documents based the query and a docid.
> > 
> > Basically, you need to implement the next() and skipTo(int
targetDoc)
> > methods from Scorer on your memory structure. They are somewhat
> > redundant (skipTo(doc() + 1) has the same semantics as next(),
except
> > initially), but that is for a mix of historical and performance
reasons.
> > Have a look at how this is done in the posted FilteredQuery class
> > for SortedVIntList and BitSet.
> > With only isValid(docId) these next() and skipTo() methods would
> > have to count over the document numbers, which is less than ideal.
> > 
> > When you use the posted code, iirc it is only necessary to implement
the
> > SkipFilter interface on your memory structure. One can use that
interface
> > to build/cache such memory structures using an IndexReader, and
> > from there the DocNrSkipper interface will do the rest (of the top
of
> > my head).
> > One slight problem with the current Lucene implementation is that
> > java.lang.BitSet is not interface.
> > 
> > Regards,
> > Paul Elschot.
> > 
> > > Thanks
> > >
> > > -John
> > >
> > > On 8/16/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > > > Hi John,
> > > >
> > > > On Wednesday 17 August 2005 04:46, John Wang wrote:
> > > > > Hi:
> > > > >
> > > > > I posted a bug (36147) a few days ago and didn't hear
anything, so
> > > > > I thought I'd try my luck on this list.
> > > > >
> > > > > The idea is to avoid score calculations on documents to be
filtered
> > > > > out anyway. (e.g. via Filter object passed to the searcher
class)
> > > > >
> > > > > This seems to be an easy change.
> > > >
> > > > Have a look here:
> > > > http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> > > >
> > > > > Also it w

Re: search performance enhancement

2005-09-21 Thread Paul Elschot

On Wednesday 21 September 2005 03:29, John Wang wrote:
> Hi Paul and other gurus:
> 
> In a related topic, seems lucene is scoring documents that would hit in a 
> "prohibited" boolean clause, e.g. NOT field:value. It doesn't seem to make 
> sense to score a document that is to be excluded from the result. Is this a 
> difficult thing to fix?

In the svn trunk (the development version) excluded clauses of
BooleanQuery are no not scored.
The default scoring method is so cheap on CPU time that this is hardly
noticeable.

> Also in Paul's ealier comment: "... unless you have large indexes, this will 
> probably not make much difference", what is "large" in this case. 

A BitSet takes one bit per document in the index, and a SortedVIntList
takes about 1 byte per document passing the filter.
So when you need many filters, each passing less than 1/8 of the indexed
docs, a SortedVIntList takes less memory.
 
> In our case, say millions of documents match some query, but after either a 
> Filter is applied or after a NOT query (e.g. query with a NOT/prohibited 
> clause) is applied, the resulting hit list has only 10 documents. Seems the 
> millions of calls to score() is wasted, some of the score() call can be 
> computational intensive.

The FilteredQuery posted here:
http://issues.apache.org/jira/browse/LUCENE-330
will score only the documents passing the filter, for both a BitSet and
a SortedVIntList filter. Btw. this is the new place for the related filter
implementations:
http://issues.apache.org/jira/browse/LUCENE-328

> 
> Am I on the right track?

That's easy: you're using Lucene.

Regards,
Paul Elschot


> Thanks
> 
> -John
> 
> On 8/19/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > 
> > On Friday 19 August 2005 18:09, John Wang wrote:
> > > Hi Paul:
> > >
> > > Thanks for the pointer.
> > >
> > > How would I extend from the patch you submitted to filter out
> > > more documents not using a Filter. e.g.
> > >
> > > have a class to skip documents based on a docID: boolean
> > > isValid(int docID)
> > >
> > > My problem is I want to discard documents at query time without
> > > having to construct a BitSet via filter. I have my own memory
> > > structure to help me skip documents based the query and a docid.
> > 
> > Basically, you need to implement the next() and skipTo(int targetDoc)
> > methods from Scorer on your memory structure. They are somewhat
> > redundant (skipTo(doc() + 1) has the same semantics as next(), except
> > initially), but that is for a mix of historical and performance reasons.
> > Have a look at how this is done in the posted FilteredQuery class
> > for SortedVIntList and BitSet.
> > With only isValid(docId) these next() and skipTo() methods would
> > have to count over the document numbers, which is less than ideal.
> > 
> > When you use the posted code, iirc it is only necessary to implement the
> > SkipFilter interface on your memory structure. One can use that interface
> > to build/cache such memory structures using an IndexReader, and
> > from there the DocNrSkipper interface will do the rest (of the top of
> > my head).
> > One slight problem with the current Lucene implementation is that
> > java.lang.BitSet is not interface.
> > 
> > Regards,
> > Paul Elschot.
> > 
> > > Thanks
> > >
> > > -John
> > >
> > > On 8/16/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > > > Hi John,
> > > >
> > > > On Wednesday 17 August 2005 04:46, John Wang wrote:
> > > > > Hi:
> > > > >
> > > > > I posted a bug (36147) a few days ago and didn't hear anything, so
> > > > > I thought I'd try my luck on this list.
> > > > >
> > > > > The idea is to avoid score calculations on documents to be filtered
> > > > > out anyway. (e.g. via Filter object passed to the searcher class)
> > > > >
> > > > > This seems to be an easy change.
> > > >
> > > > Have a look here:
> > > > http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> > > >
> > > > > Also it would be nice to expose a method to return a score given a
> > > > > docid, e.g.
> > > > >
> > > > > float getScore(int docid)
> > > > >
> > > > > on the Scorer class.
> > > >
> > > > skipTo(int docid) and score() will do that.
> > > >
> > > > > I am gonna make the change locally and do some performance analysis
> > > > > on it and will post some numbers later.
> > > >
> > > > The default score computations are mostly table lookups, and pretty 
> > fast.
> > > > So, unless you have large indexes, this will probably not make
> > > > much difference, but any performance improvement is wellcome.
> > > > In larger indexes, it helps to use skipTo() while searching.
> > > >
> > > > Regards,
> > > > Paul Elschot
> > > >
> > > >
> > > > -
> > > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > > For additional commands, e-mail: [EMAIL PROTECTED]
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > >

Re: search performance enhancement

2005-09-20 Thread John Wang

Hi Paul and other gurus:

In a related topic, seems lucene is scoring documents that would hit in a 
"prohibited" boolean clause, e.g. NOT field:value. It doesn't seem to make 
sense to score a document that is to be excluded from the result. Is this a 
difficult thing to fix?

Also in Paul's ealier comment: "... unless you have large indexes, this will 
probably not make much difference", what is "large" in this case. 

In our case, say millions of documents match some query, but after either a 
Filter is applied or after a NOT query (e.g. query with a NOT/prohibited 
clause) is applied, the resulting hit list has only 10 documents. Seems the 
millions of calls to score() is wasted, some of the score() call can be 
computational intensive.

Am I on the right track?

Thanks

-John

On 8/19/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> 
> On Friday 19 August 2005 18:09, John Wang wrote:
> > Hi Paul:
> >
> > Thanks for the pointer.
> >
> > How would I extend from the patch you submitted to filter out
> > more documents not using a Filter. e.g.
> >
> > have a class to skip documents based on a docID: boolean
> > isValid(int docID)
> >
> > My problem is I want to discard documents at query time without
> > having to construct a BitSet via filter. I have my own memory
> > structure to help me skip documents based the query and a docid.
> 
> Basically, you need to implement the next() and skipTo(int targetDoc)
> methods from Scorer on your memory structure. They are somewhat
> redundant (skipTo(doc() + 1) has the same semantics as next(), except
> initially), but that is for a mix of historical and performance reasons.
> Have a look at how this is done in the posted FilteredQuery class
> for SortedVIntList and BitSet.
> With only isValid(docId) these next() and skipTo() methods would
> have to count over the document numbers, which is less than ideal.
> 
> When you use the posted code, iirc it is only necessary to implement the
> SkipFilter interface on your memory structure. One can use that interface
> to build/cache such memory structures using an IndexReader, and
> from there the DocNrSkipper interface will do the rest (of the top of
> my head).
> One slight problem with the current Lucene implementation is that
> java.lang.BitSet is not interface.
> 
> Regards,
> Paul Elschot.
> 
> > Thanks
> >
> > -John
> >
> > On 8/16/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > > Hi John,
> > >
> > > On Wednesday 17 August 2005 04:46, John Wang wrote:
> > > > Hi:
> > > >
> > > > I posted a bug (36147) a few days ago and didn't hear anything, so
> > > > I thought I'd try my luck on this list.
> > > >
> > > > The idea is to avoid score calculations on documents to be filtered
> > > > out anyway. (e.g. via Filter object passed to the searcher class)
> > > >
> > > > This seems to be an easy change.
> > >
> > > Have a look here:
> > > http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> > >
> > > > Also it would be nice to expose a method to return a score given a
> > > > docid, e.g.
> > > >
> > > > float getScore(int docid)
> > > >
> > > > on the Scorer class.
> > >
> > > skipTo(int docid) and score() will do that.
> > >
> > > > I am gonna make the change locally and do some performance analysis
> > > > on it and will post some numbers later.
> > >
> > > The default score computations are mostly table lookups, and pretty 
> fast.
> > > So, unless you have large indexes, this will probably not make
> > > much difference, but any performance improvement is wellcome.
> > > In larger indexes, it helps to use skipTo() while searching.
> > >
> > > Regards,
> > > Paul Elschot
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> >
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

Re: search performance enhancement

2005-08-19 Thread Paul Elschot

On Friday 19 August 2005 18:09, John Wang wrote:
> Hi Paul:
> 
>   Thanks for the pointer.
> 
>How would I extend from the patch you submitted to filter out
> more documents not using a Filter. e.g.
> 
>   have a class to skip documents based on a docID: boolean
> isValid(int docID)
> 
>   My problem is I want to discard documents at query time without
> having to construct a BitSet via filter. I have my own memory
> structure to help me skip documents based the query and a docid.

Basically, you need to implement the next() and skipTo(int targetDoc)
methods from Scorer on your memory structure. They are somewhat
redundant (skipTo(doc() + 1) has the same semantics as next(), except
initially), but that is for a mix of historical and performance reasons.
Have a look at how this is done in the posted FilteredQuery class
for SortedVIntList and BitSet.
With only isValid(docId) these next() and skipTo() methods would
have to count over the document numbers, which is less than ideal.

When you use the posted code, iirc it is only necessary to implement the
SkipFilter interface on your memory structure. One can use that interface
to build/cache  such memory structures using an IndexReader, and
from there the DocNrSkipper interface will do the rest (of the top of
my head).
One slight problem with the current Lucene implementation is that
java.lang.BitSet is not interface.

Regards,
Paul Elschot.

> Thanks
> 
> -John
> 
> On 8/16/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> > Hi John,
> > 
> > On Wednesday 17 August 2005 04:46, John Wang wrote:
> > > Hi:
> > >
> > >I posted a bug (36147) a few days ago and didn't hear anything, so
> > > I thought I'd try my luck on this list.
> > >
> > >The idea is to avoid score calculations on documents to be filtered
> > > out anyway. (e.g. via Filter object passed to the searcher class)
> > >
> > >This seems to be an easy change.
> > 
> > Have a look here:
> > http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> > 
> > >Also it would be nice to expose a method to return a score given a
> > > docid, e.g.
> > >
> > >float getScore(int docid)
> > >
> > >on the Scorer class.
> > 
> > skipTo(int docid) and score() will do that.
> > 
> > >I am gonna make the change locally and do some performance analysis
> > > on it and will post some numbers later.
> > 
> > The default score computations are mostly table lookups, and pretty fast.
> > So, unless you have large indexes, this will probably not make
> > much difference, but any performance improvement is wellcome.
> > In larger indexes, it helps to use skipTo() while searching.
> > 
> > Regards,
> > Paul Elschot
> > 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> >
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search performance enhancement

2005-08-19 Thread John Wang

Hi Paul:

  Thanks for the pointer.

   How would I extend from the patch you submitted to filter out
more documents not using a Filter. e.g.

  have a class to skip documents based on a docID: boolean
isValid(int docID)

  My problem is I want to discard documents at query time without
having to construct a BitSet via filter. I have my own memory
structure to help me skip documents based the query and a docid.

Thanks

-John

On 8/16/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> Hi John,
> 
> On Wednesday 17 August 2005 04:46, John Wang wrote:
> > Hi:
> >
> >I posted a bug (36147) a few days ago and didn't hear anything, so
> > I thought I'd try my luck on this list.
> >
> >The idea is to avoid score calculations on documents to be filtered
> > out anyway. (e.g. via Filter object passed to the searcher class)
> >
> >This seems to be an easy change.
> 
> Have a look here:
> http://issues.apache.org/bugzilla/show_bug.cgi?id=32965
> 
> >Also it would be nice to expose a method to return a score given a
> > docid, e.g.
> >
> >float getScore(int docid)
> >
> >on the Scorer class.
> 
> skipTo(int docid) and score() will do that.
> 
> >I am gonna make the change locally and do some performance analysis
> > on it and will post some numbers later.
> 
> The default score computations are mostly table lookups, and pretty fast.
> So, unless you have large indexes, this will probably not make
> much difference, but any performance improvement is wellcome.
> In larger indexes, it helps to use skipTo() while searching.
> 
> Regards,
> Paul Elschot
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search performance enhancement

2005-08-16 Thread Paul Elschot

Hi John,

On Wednesday 17 August 2005 04:46, John Wang wrote:
> Hi:
> 
>I posted a bug (36147) a few days ago and didn't hear anything, so
> I thought I'd try my luck on this list.
> 
>The idea is to avoid score calculations on documents to be filtered
> out anyway. (e.g. via Filter object passed to the searcher class)
> 
>This seems to be an easy change.

Have a look here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=32965

>Also it would be nice to expose a method to return a score given a
> docid, e.g.
> 
>float getScore(int docid)
> 
>on the Scorer class.

skipTo(int docid) and score() will do that.
 
>I am gonna make the change locally and do some performance analysis
> on it and will post some numbers later.

The default score computations are mostly table lookups, and pretty fast.
So, unless you have large indexes, this will probably not make
much difference, but any performance improvement is wellcome.
In larger indexes, it helps to use skipTo() while searching.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance under high load

2005-04-07 Thread David Spencer

Yura Smolsky wrote:
Hello, mark.
mh> 2) My app uses long queries, some of which include
mh> very common terms. Using the "MoreLikeThis" query to
mh> drop common terms drastically improved performance. If
mh> your "killer queries" are long ones you could spot
mh> them and service them with a MoreLikeThis or simply
mh> limit the number of allowed terms in the query string.
Can you please explain what is "MoreLikeThis" query in the Lucene?
Not sure exactly how Mark is using it, but several of us worked on the 
MoreLikeThis query similarity generator:

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/similarity/
Various javadoc links etc here:
http://searchmorph.com/weblog/index.php?id=44
Yura Smolsky.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance under high load

2005-04-07 Thread mark harwood

In addition to the comments already made, I recently
recently found these changes to be useful:

1) Swapping out Sun 1.4.2_05 JVM for BEA's JRockit JVM
halved my query times. (In both cases did not tweak
any default JVM settings other than -Xmx to ensure
adequate memory allocation). 

2) My app uses long queries, some of which include
very common terms. Using the "MoreLikeThis" query to
drop common terms drastically improved performance. If
your "killer queries" are long ones you could spot
them and service them with a MoreLikeThis or simply
limit the number of allowed terms in the query string.


Cheers
Mark

Send instant messages to your online friends http://uk.messenger.yahoo.com 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance under high load

2005-04-07 Thread Paul Elschot

Daniel,

On Thursday 07 April 2005 00:54, Chris Hostetter wrote:
> 
> : Queries: The query strings are of highly differing complexity, from
> : simple x:y to long queries involving conjunctions, disjunctions and
> : wildecard queries.
> :
> : 90% of the queries run brilliantly. Problem is that 10% of the queries
> : (simple or not) take a long time, on average more that 10 seconds,
> : sometimes several minutes.
> 
> without knowing the nature of the queries, these numbers are not outside
> the realm of possibility.  there have been examples on the list in the
> last few days of how BooleanQueries constructed with deep nesting have
> particularly bad performance.
> 
> I would suggest you timing logs to your Search code so that you get one
> log line per search executed telling you:
> 
> 1) the time of day the search was executed
> 2) the total time taken by the Searcher.search(Query) call
> 3) the Query.toString() of the search.
> 4) the Hits.length() of the result.
> 5) any tracking information to help you identify where the search came
>from (ie: canned search from a category listing page, user entered
>freeform text, your RSS feed genertor, etc...)
> 
> This will help you determine:
> 
>  a) is there a common element to the structure of queries that take more
> then a certain amount of time?

In general, disjunctions (truncations, fuzzy queries) are slow, and
conjunctions (required terms, filters) are faster.

>  b) are the slow queries clustured by time of day? is anything else
> happening on that box during that time?
>  c) are the "slow" queries all resulting in a high number of Hits?
>  d) are the slow searches all orriginating from a single source? (ie: are
> the queries needed by categlory listing pages all really slow) can
> they be re-implimented differently?
>  e) is there anything else the slow queries have in common?

I think your case is CPU bound, so you have a few options:
- use more CPU's,
- get in touch with the 'power' users, (via the logs as suggested by Chris)
  and find out it there are  simple measures you can take to help performance
  for them. For example, replacing a range that is repeatedly used by a cached
  filter can be quite effective.

Regards,
Paul Elschot


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance under high load

2005-04-06 Thread David Spencer

Daniel Herlitz wrote:
Hi everybody,
We have been using Lucene for about one year now with great success. 
Recently though the index has growed noticably and so has the number of 
searches. I was wondering if anyone would like to comment on these 
figures and say if it works for them?

Index size: ~2.5 GB, on disk
Number of fields: ~30
Number of indexed fields: ~10
Server: Linux, Intel(R) Xeon(TM) CPU 3.00GHz, 3GB, dedicated to Lucene 
searches.
Java: Sun 1.5, -Xmx1200m
For perf tuning on 1.4+ VMs I always try these flags too:
-server
-XX:CompileThreshold=100
-Xverify:none
And also worth considering is giving a -Xms value equal to -Xmx.

Load: Approaching 2000 requests / hour.
Queries: The query strings are of highly differing complexity, from 
simple x:y to long queries involving conjunctions, disjunctions and 
wildecard queries.

90% of the queries run brilliantly. Problem is that 10% of the queries 
(simple or not) take a long time, on average more that 10 seconds, 
sometimes several minutes.

We have managed to track down these figures to the calls to 
IndexSearcher.search(Query). We have seen up to about 10 searches 
concurrently executing.

We have tried to run the server on different machines and with different 
version of Java. We have no OutOfMemorys.

I am curious about what to expect from Lucene when it comes to 
searching. There are lots of figures about the indexing speed (no 
question about that, it's incredibly fast!). But what about searching? 
And searching with the kind of load we have. Anyone in the same 
situation as we are? Comments? Suggestions?
Well in a benchmark I was doing recently fuzzy queries were the problem 
in the mix I had - but to be fair, a fuzzy search is really just a big 
query as it expands query to be all "similar" terms.

Also of interest is what's the problem w/ the long running queries - are 
they slowing down the response time for the other users w/ shorter 
queries?

I've never done this, but you could consider a thread pool to execute 
the queries, and once a query takes more than, say, a second, you lower 
its priority.

Also, I'd have a rule like no more than "n" slow queries can run at 
once, so you queue up slow queries if there are lots of them executing.


Thanks
Daniel

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Search performance under high load

2005-04-06 Thread Chris Hostetter


: Queries: The query strings are of highly differing complexity, from
: simple x:y to long queries involving conjunctions, disjunctions and
: wildecard queries.
:
: 90% of the queries run brilliantly. Problem is that 10% of the queries
: (simple or not) take a long time, on average more that 10 seconds,
: sometimes several minutes.

without knowing the nature of the queries, these numbers are not outside
the realm of possibility.  there have been examples on the list in the
last few days of how BooleanQueries constructed with deep nesting have
particularly bad performance.

I would suggest you timing logs to your Search code so that you get one
log line per search executed telling you:

1) the time of day the search was executed
2) the total time taken by the Searcher.search(Query) call
3) the Query.toString() of the search.
4) the Hits.length() of the result.
5) any tracking information to help you identify where the search came
   from (ie: canned search from a category listing page, user entered
   freeform text, your RSS feed genertor, etc...)

This will help you determine:

 a) is there a common element to the structure of queries that take more
then a certain amount of time?
 b) are the slow queries clustured by time of day? is anything else
happening on that box during that time?
 c) are the "slow" queries all resulting in a high number of Hits?
 d) are the slow searches all orriginating from a single source? (ie: are
the queries needed by categlory listing pages all really slow) can
they be re-implimented differently?
 e) is there anything else the slow queries have in common?



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

86 matches

Mail list logo