Re: search performance

2014-06-03 Thread Toke Eskildsen
On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:

[200GB, 150M documents]

 With NRT enabled, search speed is roughly 5 minutes on average.
 The server resources are: 
 2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

5 minutes is extremely long. Is that really the right number? I do not
see a hardware upgrade changing that with the fine machine you're using.

What is your search speed if you disable continuous updates?

When you restart the searcher, how long does the first search take?


- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Jamie

Toke

Thanks for the comment.

Unfortunately, in this instance, it is a live production system, so we 
cannot conduct experiments. The number is definitely accurate.


We have many different systems with a similar load that observe the same 
performance issue. To my knowledge, the Lucene integration code is 
fairly well optimized.


I've requested access to the indexes so that we can perform further testing.

Regards

Jamie

On 2014/06/03, 8:09 AM, Toke Eskildsen wrote:

On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:

[200GB, 150M documents]


With NRT enabled, search speed is roughly 5 minutes on average.
The server resources are:
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

5 minutes is extremely long. Is that really the right number? I do not
see a hardware upgrade changing that with the fine machine you're using.

What is your search speed if you disable continuous updates?

When you restart the searcher, how long does the first search take?


- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Christoph Kaser
Can you take thread stacktraces (repeatedly) during those 5 minute 
searches? That might give you (or someone on the mailing list) a clue 
where all that time is spent.
You could try using jstack for that: 
http://docs.oracle.com/javase/7/docs/technotes/tools/share/jstack.html


Regards
Christoph

Am 03.06.2014 08:17, schrieb Jamie:

Toke

Thanks for the comment.

Unfortunately, in this instance, it is a live production system, so we 
cannot conduct experiments. The number is definitely accurate.


We have many different systems with a similar load that observe the 
same performance issue. To my knowledge, the Lucene integration code 
is fairly well optimized.


I've requested access to the indexes so that we can perform further 
testing.


Regards

Jamie

On 2014/06/03, 8:09 AM, Toke Eskildsen wrote:

On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:

[200GB, 150M documents]


With NRT enabled, search speed is roughly 5 minutes on average.
The server resources are:
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

5 minutes is extremely long. Is that really the right number? I do not
see a hardware upgrade changing that with the fine machine you're using.

What is your search speed if you disable continuous updates?

When you restart the searcher, how long does the first search take?


- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Toke Eskildsen
On Tue, 2014-06-03 at 08:17 +0200, Jamie wrote:
 Unfortunately, in this instance, it is a live production system, so we 
 cannot conduct experiments. The number is definitely accurate.
 
 We have many different systems with a similar load that observe the same 
 performance issue. To my knowledge, the Lucene integration code is 
 fairly well optimized.

It is possible that the extreme slowness is a combination of factors,
but with a bit of luck it will boil down to a single thing. Standard
procedure it to disable features until it performs well, so

- Disable running updates
- Limit page size
- Limit lookup of returned fields
- Disable highlighting
- Simpler queries
- Whatever else you might think of

At some point along the way I would expect a sharp increase in
performance.

 I've requested access to the indexes so that we can perform further testing.

Great.

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Jamie

Toke

Thanks for the contact. See below:

On 2014/06/03, 9:17 AM, Toke Eskildsen wrote:

On Tue, 2014-06-03 at 08:17 +0200, Jamie wrote:

Unfortunately, in this instance, it is a live production system, so we
cannot conduct experiments. The number is definitely accurate.

We have many different systems with a similar load that observe the same
performance issue. To my knowledge, the Lucene integration code is
fairly well optimized.

It is possible that the extreme slowness is a combination of factors,
but with a bit of luck it will boil down to a single thing. Standard
procedure it to disable features until it performs well, so

- Disable running updates

No can do.

- Limit page size

Done this.

- Limit lookup of returned fields

Done this.

- Disable highlighting

No highlighting.

- Simpler queries

They are as simple as possible.

- Whatever else you might think of
Our application has been using Lucene for seven years. It has been 
constantly optimized over that period.


I'll conduct further testing...


At some point along the way I would expect a sharp increase in
performance.


I've requested access to the indexes so that we can perform further testing.

Great.

- Toke Eskildsen, State and University Library, Denmark






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Vitaly Funstein
Something doesn't quite add up.

TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true,
 false, false, true);

 We use pagination, so only returning 1000 documents or so at a time.


You say you are using pagination, yet the API you are using to create your
collector isn't how you would utilize Lucene's built-in pagination
feature (unless misunderstand the API). If the max is the snippet above is
1000, then you're simply returning top 1000 docs every time you execute
your search. Otherwise... well, could you actually post a bit more of your
code that runs the search here, in particular?

Assuming that the max is much larger than 1000, however, you could call
fieldCollector.topDocs(int, int) after accumulating hits using this
collector, but this won't work multiple times per query execution,
according to the javadoc. So you either have to re-execute the full search,
and then get the next chunk of ScoreDocs, or use the proper API for this,
one that accepts as a parameter the end of the previous page of results,
i.e. IndexSearcher.searchAfter(ScoreDoc, ...)


Re: search performance

2014-06-03 Thread Jamie

Vitaly

Thanks for the contribution. Unfortunately, we cannot use Lucene's 
pagination function, because in reality the user can skip pages to start 
the search at any point, not just from the end of the previous search. 
Even the
first search (without any pagination), with a max of 1000 hits, takes 5 
minutes to complete.


Regards

Jamie
On 2014/06/03, 10:54 AM, Vitaly Funstein wrote:

Something doesn't quite add up.

TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true,

false, false, true);

We use pagination, so only returning 1000 documents or so at a time.



You say you are using pagination, yet the API you are using to create your
collector isn't how you would utilize Lucene's built-in pagination
feature (unless misunderstand the API). If the max is the snippet above is
1000, then you're simply returning top 1000 docs every time you execute
your search. Otherwise... well, could you actually post a bit more of your
code that runs the search here, in particular?

Assuming that the max is much larger than 1000, however, you could call
fieldCollector.topDocs(int, int) after accumulating hits using this
collector, but this won't work multiple times per query execution,
according to the javadoc. So you either have to re-execute the full search,
and then get the next chunk of ScoreDocs, or use the proper API for this,
one that accepts as a parameter the end of the previous page of results,
i.e. IndexSearcher.searchAfter(ScoreDoc, ...)




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Rob Audenaerde
Hi Jamie,

What is included in the 5 minutes?

Just the call to the searcher?

seacher.search(...) ?

Can you show a bit more of the code you use?



On Tue, Jun 3, 2014 at 11:32 AM, Jamie ja...@mailarchiva.com wrote:

 Vitaly

 Thanks for the contribution. Unfortunately, we cannot use Lucene's
 pagination function, because in reality the user can skip pages to start
 the search at any point, not just from the end of the previous search. Even
 the
 first search (without any pagination), with a max of 1000 hits, takes 5
 minutes to complete.

 Regards

 Jamie

 On 2014/06/03, 10:54 AM, Vitaly Funstein wrote:

 Something doesn't quite add up.

 TopFieldCollector fieldCollector = TopFieldCollector.create(sort,
 max,true,

 false, false, true);

 We use pagination, so only returning 1000 documents or so at a time.


  You say you are using pagination, yet the API you are using to create
 your
 collector isn't how you would utilize Lucene's built-in pagination
 feature (unless misunderstand the API). If the max is the snippet above is
 1000, then you're simply returning top 1000 docs every time you execute
 your search. Otherwise... well, could you actually post a bit more of your
 code that runs the search here, in particular?

 Assuming that the max is much larger than 1000, however, you could call
 fieldCollector.topDocs(int, int) after accumulating hits using this
 collector, but this won't work multiple times per query execution,
 according to the javadoc. So you either have to re-execute the full
 search,
 and then get the next chunk of ScoreDocs, or use the proper API for this,
 one that accepts as a parameter the end of the previous page of results,
 i.e. IndexSearcher.searchAfter(ScoreDoc, ...)



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: search performance

2014-06-03 Thread Jamie

Sure... see below:

protected void search(Query query, Filter queryFilter, Sort sort)
throws BlobSearchException {
try {
logger.debug(start search  {searchquery=' + 
getSearchQuery() + 
',query='+query.toString()+',filterQuery='+queryFilter+',sort='+sort+'});

Thread.currentThread().setPriority(Thread.MAX_PRIORITY);
results.clear();

int max;

if (getPagination()) {
max = start + length;
} else {
max = getMaxResults();
}

// release the old volume searchers
IndexReader indexReader = initIndexReader();
searcher = new IndexSearcher(indexReader,executor);
TopFieldCollector fieldCollector = 
TopFieldCollector.create(sort, max,true, false, false, true);


searcher.search(query, queryFilter, fieldCollector);

TopDocs topDocs;

if (getPagination()) {
topDocs = fieldCollector.topDocs(start,length);
} else {
topDocs = fieldCollector.topDocs();
}

int count = 0;
for (int i = 0; i  topDocs.scoreDocs.length; i++) {
if ((getMaxResults()0  count  getMaxResults()) || 
(getPagination()  count++=length)) { break; }

results.add(topDocs.scoreDocs[i]);
}

totalHits = fieldCollector.getTotalHits();

logger.debug(search executed successfully {query='+ 
getSearchQuery() + ',returnedresults=' + results.size()+ '});

} catch (Exception io) {
throw new BlobSearchException(failed to execute search 
query {searchquery='+ getSearchQuery() + }, io, logger, 
ChainedException.Level.DEBUG);

}
}
On 2014/06/03, 11:41 AM, Rob Audenaerde wrote:

Hi Jamie,

What is included in the 5 minutes?

Just the call to the searcher?

seacher.search(...) ?

Can you show a bit more of the code you use?



On Tue, Jun 3, 2014 at 11:32 AM, Jamie ja...@mailarchiva.com wrote:


Vitaly

Thanks for the contribution. Unfortunately, we cannot use Lucene's
pagination function, because in reality the user can skip pages to start
the search at any point, not just from the end of the previous search. Even
the
first search (without any pagination), with a max of 1000 hits, takes 5
minutes to complete.

Regards

Jamie

On 2014/06/03, 10:54 AM, Vitaly Funstein wrote:


Something doesn't quite add up.

TopFieldCollector fieldCollector = TopFieldCollector.create(sort,
max,true,


false, false, true);

We use pagination, so only returning 1000 documents or so at a time.


  You say you are using pagination, yet the API you are using to create

your
collector isn't how you would utilize Lucene's built-in pagination
feature (unless misunderstand the API). If the max is the snippet above is
1000, then you're simply returning top 1000 docs every time you execute
your search. Otherwise... well, could you actually post a bit more of your
code that runs the search here, in particular?

Assuming that the max is much larger than 1000, however, you could call
fieldCollector.topDocs(int, int) after accumulating hits using this
collector, but this won't work multiple times per query execution,
according to the javadoc. So you either have to re-execute the full
search,
and then get the next chunk of ScoreDocs, or use the proper API for this,
one that accepts as a parameter the end of the previous page of results,
i.e. IndexSearcher.searchAfter(ScoreDoc, ...)



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Jamie

FYI: We are also using a multireader to search over multiple index readers.

Search under a million documents yields good response times. When you 
get into the 60M territory, search slows to a crawl.


On 2014/06/03, 11:47 AM, Jamie wrote:

Sure... see below:



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Vitaly Funstein
A couple of questions.

1. What are you trying to achieve by setting the current thread's priority
to max possible value? Is it grabbing as much CPU time as possible? In my
experience, mucking with thread priorities like this is at best futile, and
at worst quite detrimental to responsiveness and overall performance of the
system as a whole. I would remove that line.

2. This seems suspicious:

if (getPagination()) {
max = start + length;
} else {
max = getMaxResults();
}

If start is at 100M, and length is 1000 - what do you think Lucene will try
and do when you pass this max to the collector?


On Tue, Jun 3, 2014 at 2:55 AM, Jamie ja...@mailarchiva.com wrote:

 FYI: We are also using a multireader to search over multiple index readers.

 Search under a million documents yields good response times. When you get
 into the 60M territory, search slows to a crawl.

 On 2014/06/03, 11:47 AM, Jamie wrote:

 Sure... see below:



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: search performance

2014-06-03 Thread Jamie

Vitaly

See below:

On 2014/06/03, 12:09 PM, Vitaly Funstein wrote:

A couple of questions.

1. What are you trying to achieve by setting the current thread's priority
to max possible value? Is it grabbing as much CPU time as possible? In my
experience, mucking with thread priorities like this is at best futile, and
at worst quite detrimental to responsiveness and overall performance of the
system as a whole. I would remove that line.
Yes,  you are right to be worried about this, especially since thread 
priorities behave differently on different platforms.




2. This seems suspicious:

if (getPagination()) {
 max = start + length;
 } else {
 max = getMaxResults();
 }

If start is at 100M, and length is 1000 - what do you think Lucene will try
and do when you pass this max to the collector?
I dont see the problem here. The collector will start from zero to max 
results. I agree that from a performance perspective, ts not ideal to 
return all results from the beginning of the search, but the Lucene API 
us with no choice. I simply do not know the ScoreDoc to start from. If I 
did keep a record of it, then I would need to store all scoredocs for 
the entire result set. When there are 60M+ results, this can be 
problematic in terms of memory consumption. It would be far nicer if 
there was a searchAfter function that took a position as an integer.


Regards

Jamie






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Robert Muir
Check and make sure you are not opening an indexreader for every
search. Be sure you don't do that.

On Mon, Jun 2, 2014 at 2:51 AM, Jamie ja...@mailarchiva.com wrote:
 Greetings

 Despite following all the recommended optimizations (as described at
 http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of our
 installations, search performance has reached the point where is it
 unacceptably slow. For instance, in one environment, the total index size is
 200GB, with 150 million documents indexed. With NRT enabled, search speed is
 roughly 5 minutes on average. The server resources are: 2x6 Core Intel CPU,
 128GB, 2 SSD for index and RAID 0, with Linux.

 The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to
 4.8.x. Is this likely to make any noticeable difference in performance?

 Clearly, longer term, we need to move to a distributed search model. We
 thought to take advantage of the distributed search features offered in
 Solr, however, our solution is very tightly integrated into Lucene directly
 (since Solr didn't exist when we started out). Moving to Solr now seems like
 a daunting prospect. We've also following the Katta project with interest,
 but it doesn't appear support distributed indexing, and development on it
 seems to have stalled. It would be nice if there were a distributed search
 project on the Lucene level that we could use.

 I realize this is a rather vague question, but are there any further
 suggestions on ways to improve search performance? We need cheap and dirty
 ideas, as well as longer term advice on a possible path forward.

 Much appreciate

 Jamie

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Vitaly Funstein
Jamie,

What if you were to forget for a moment the whole pagination idea, and
always capped your search at 1000 results for testing purposes only? This
is just to try and pinpoint the bottleneck here; if, regardless of the
query parameters, the search latency stays roughly the same and well below
5 min, you now have the answer - the problem is your naive implementation
of pagination which results in snowballing result numbers and search times,
the closer you get to the end of the results range. Otherwise, I would
focus on your query and filter next.


On Tue, Jun 3, 2014 at 3:21 AM, Jamie ja...@mailarchiva.com wrote:

 Vitaly

 See below:


 On 2014/06/03, 12:09 PM, Vitaly Funstein wrote:

 A couple of questions.

 1. What are you trying to achieve by setting the current thread's priority
 to max possible value? Is it grabbing as much CPU time as possible? In my
 experience, mucking with thread priorities like this is at best futile,
 and
 at worst quite detrimental to responsiveness and overall performance of
 the
 system as a whole. I would remove that line.

 Yes,  you are right to be worried about this, especially since thread
 priorities behave differently on different platforms.



 2. This seems suspicious:

 if (getPagination()) {
  max = start + length;
  } else {
  max = getMaxResults();
  }

 If start is at 100M, and length is 1000 - what do you think Lucene will
 try
 and do when you pass this max to the collector?

 I dont see the problem here. The collector will start from zero to max
 results. I agree that from a performance perspective, ts not ideal to
 return all results from the beginning of the search, but the Lucene API us
 with no choice. I simply do not know the ScoreDoc to start from. If I did
 keep a record of it, then I would need to store all scoredocs for the
 entire result set. When there are 60M+ results, this can be problematic in
 terms of memory consumption. It would be far nicer if there was a
 searchAfter function that took a position as an integer.

 Regards

 Jamie





 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Fwd: Reader reopen

2014-06-03 Thread Gergő Törcsvári
Hello,

If I have an AtomicReader, and an IndexSearcher can I reopen the index to
get the new documents?
Like there:
http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/index/IndexReader.html#reopen%28%29
There is any workaround?

Thanks,
Gergő

P.S.: I accidentaly send this to general list too. Sorry for that.


Re: Reader reopen

2014-06-03 Thread Michael McCandless
Sure, just use DirectoryReader.openIfChanged.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Jun 3, 2014 at 6:36 AM, Gergő Törcsvári
torcsvari.ge...@gmail.com wrote:
 Hello,

 If I have an AtomicReader, and an IndexSearcher can I reopen the index to
 get the new documents?
 Like there:
 http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/index/IndexReader.html#reopen%28%29
 There is any workaround?

 Thanks,
 Gergő

 P.S.: I accidentaly send this to general list too. Sorry for that.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Jamie

Vitality / Robert

I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes. 
Unless I am mistaken, the Lucene library's pagination mechanism, makes 
the assumption that you will cache the scoredocs for the entire result 
set. This is not practical  when you have a result set that exceeds 60M. 
As stated earlier, in any case, it is the first query that is slow.


We do open index readers.. since we are using NRT search. Since 
documents are being added to the indexes on a continuous basis. When the 
user clicks on the Search button, the user will expect to see the latest 
result set. With regards to NRT search, my understanding is that we do 
need to open the index readers on each search operation to see the 
latest changes.


Thus, on each search, we combine the indexreaders into a multireader, 
and open each reader based their corresponding writer.


protected IndexReader initIndexReader() {
ListIndexReader readers = new LinkedList();
for (Writer writer : writers) {
readers.add(DirectoryReader.open(writer, true);
}
return MultiReader(readers,true);
}

Thank you for your ideas/suggestions.

Regards

Jamie
On 2014/06/03, 12:29 PM, Vitaly Funstein wrote:

Jamie,

What if you were to forget for a moment the whole pagination idea, and
always capped your search at 1000 results for testing purposes only? This
is just to try and pinpoint the bottleneck here; if, regardless of the
query parameters, the search latency stays roughly the same and well below
5 min, you now have the answer - the problem is your naive implementation
of pagination which results in snowballing result numbers and search times,
the closer you get to the end of the results range. Otherwise, I would
focus on your query and filter next.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Robert Muir
No, you are incorrect. The point of a search engine is to return top-N
most relevant.

If you insist you need to open an indexreader on every single search,
and then return huge amounts of docs, maybe you should use a database
instead.

On Tue, Jun 3, 2014 at 6:42 AM, Jamie ja...@mailarchiva.com wrote:
 Vitality / Robert

 I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes.
 Unless I am mistaken, the Lucene library's pagination mechanism, makes the
 assumption that you will cache the scoredocs for the entire result set. This
 is not practical  when you have a result set that exceeds 60M. As stated
 earlier, in any case, it is the first query that is slow.

 We do open index readers.. since we are using NRT search. Since documents
 are being added to the indexes on a continuous basis. When the user clicks
 on the Search button, the user will expect to see the latest result set.
 With regards to NRT search, my understanding is that we do need to open the
 index readers on each search operation to see the latest changes.

 Thus, on each search, we combine the indexreaders into a multireader, and
 open each reader based their corresponding writer.

 protected IndexReader initIndexReader() {
 ListIndexReader readers = new LinkedList();
 for (Writer writer : writers) {
 readers.add(DirectoryReader.open(writer, true);
 }
 return MultiReader(readers,true);
 }

 Thank you for your ideas/suggestions.

 Regards

 Jamie

 On 2014/06/03, 12:29 PM, Vitaly Funstein wrote:

 Jamie,

 What if you were to forget for a moment the whole pagination idea, and
 always capped your search at 1000 results for testing purposes only? This
 is just to try and pinpoint the bottleneck here; if, regardless of the
 query parameters, the search latency stays roughly the same and well below
 5 min, you now have the answer - the problem is your naive implementation
 of pagination which results in snowballing result numbers and search
 times,
 the closer you get to the end of the results range. Otherwise, I would
 focus on your query and filter next.



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Jamie

Robert

Hmmm. why did Mike go to all the trouble of implementing NRT search, 
if we are not supposed to be using it?


The user simply wants the latest result set. To me, this doesn't appear 
out of scope for the Lucene project.


Jamie

On 2014/06/03, 1:17 PM, Robert Muir wrote:

No, you are incorrect. The point of a search engine is to return top-N
most relevant.

If you insist you need to open an indexreader on every single search,
and then return huge amounts of docs, maybe you should use a database
instead.

On Tue, Jun 3, 2014 at 6:42 AM, Jamie ja...@mailarchiva.com wrote:

Vitality / Robert

I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes.
Unless I am mistaken, the Lucene library's pagination mechanism, makes the
assumption that you will cache the scoredocs for the entire result set. This
is not practical  when you have a result set that exceeds 60M. As stated
earlier, in any case, it is the first query that is slow.

We do open index readers.. since we are using NRT search. Since documents
are being added to the indexes on a continuous basis. When the user clicks
on the Search button, the user will expect to see the latest result set.
With regards to NRT search, my understanding is that we do need to open the
index readers on each search operation to see the latest changes.

Thus, on each search, we combine the indexreaders into a multireader, and
open each reader based their corresponding writer.

protected IndexReader initIndexReader() {
 ListIndexReader readers = new LinkedList();
 for (Writer writer : writers) {
 readers.add(DirectoryReader.open(writer, true);
 }
 return MultiReader(readers,true);
}

Thank you for your ideas/suggestions.

Regards

Jamie



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Jamie

Robert

FYI: I've modified the code to utilize the experimental function..

DirectoryReader dirReader = 
DirectoryReader.openIfChanged(cachedDirectoryReader,writer, true);


In this case, the IndexReader won't be opened on each search, unless 
absolutely necessary.


Regards

Jamie

On 2014/06/03, 1:25 PM, Jamie wrote:



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Robert Muir
Reopening for every search is not a good idea. this will have an
extremely high cost (not as high as what you are doing with paging
but still not good).

Instead consider making it near-realtime, by doing this every second
or so instead. Look at SearcherManager for code that helps you do
this.

On Tue, Jun 3, 2014 at 7:25 AM, Jamie ja...@mailarchiva.com wrote:
 Robert

 Hmmm. why did Mike go to all the trouble of implementing NRT search, if
 we are not supposed to be using it?

 The user simply wants the latest result set. To me, this doesn't appear out
 of scope for the Lucene project.

 Jamie


 On 2014/06/03, 1:17 PM, Robert Muir wrote:

 No, you are incorrect. The point of a search engine is to return top-N
 most relevant.

 If you insist you need to open an indexreader on every single search,
 and then return huge amounts of docs, maybe you should use a database
 instead.

 On Tue, Jun 3, 2014 at 6:42 AM, Jamie ja...@mailarchiva.com wrote:

 Vitality / Robert

 I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes.
 Unless I am mistaken, the Lucene library's pagination mechanism, makes
 the
 assumption that you will cache the scoredocs for the entire result set.
 This
 is not practical  when you have a result set that exceeds 60M. As stated
 earlier, in any case, it is the first query that is slow.

 We do open index readers.. since we are using NRT search. Since documents
 are being added to the indexes on a continuous basis. When the user
 clicks
 on the Search button, the user will expect to see the latest result set.
 With regards to NRT search, my understanding is that we do need to open
 the
 index readers on each search operation to see the latest changes.

 Thus, on each search, we combine the indexreaders into a multireader, and
 open each reader based their corresponding writer.

 protected IndexReader initIndexReader() {
  ListIndexReader readers = new LinkedList();
  for (Writer writer : writers) {
  readers.add(DirectoryReader.open(writer, true);
  }
  return MultiReader(readers,true);
 }

 Thank you for your ideas/suggestions.

 Regards

 Jamie



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Jamie
Robert. Thanks, I've already done a similar thing. Results on my test 
platform are encouraging..


On 2014/06/03, 2:41 PM, Robert Muir wrote:

Reopening for every search is not a good idea. this will have an
extremely high cost (not as high as what you are doing with paging
but still not good).

Instead consider making it near-realtime, by doing this every second
or so instead. Look at SearcherManager for code that helps you do
this.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Jon Stewart
With regards to pagination, is there a way for you to cache the
IndexSearcher, Query, and TopDocs between user pagination requests (a
lot of webapp frameworks have object caching mechanisms)? If so, you
may have luck with code like this:

  void ensureTopDocs(final int rank) throws IOException {
if (StartDocIndex  rank) {
  Docs = Searcher.search(SearchQuery, TOP_DOCS_WINDOW);
  StartDocIndex = 0;
}
int len = Docs.scoreDocs.length;
while (StartDocIndex + len = rank) {
  StartDocIndex += len;
  Docs = Searcher.searchAfter(Docs.scoreDocs[len - 1],
SearchQuery, TOP_DOCS_WINDOW);
  len = Docs.scoreDocs.length;
}
  }

StartDocIndex is a member variable denoting the current rank of the
first item in TopDocs (Docs) window. I call this function before
each Document retrieval. The common case--of the user looking at the
first page of results or the user advancing to the next page--is quite
fast. But it still supports random access, albeit not in constant
time. OTOH, if your app is concurrent, most search queries will
probably be returned very quickly so the odd query that wants to jump
deep into the result set will have more of the server's resources
available to it.

Also, given the size of your result sets, you have to allocate a lot
of memory upfront which will then get gc'd after some time. From query
to query, you will have a decent amount of memory churn. This isn't
free. My guess is using Lucene's linear (search()  searchAfter())
pagination will perform faster than your current approach just based
upon not having to create such large arrays.

I'm not the Lucene expert that Robert is, but this has worked alright for me.

cheers,

Jon


On Tue, Jun 3, 2014 at 8:47 AM, Jamie ja...@mailarchiva.com wrote:
 Robert. Thanks, I've already done a similar thing. Results on my test
 platform are encouraging..


 On 2014/06/03, 2:41 PM, Robert Muir wrote:

 Reopening for every search is not a good idea. this will have an
 extremely high cost (not as high as what you are doing with paging
 but still not good).

 Instead consider making it near-realtime, by doing this every second
 or so instead. Look at SearcherManager for code that helps you do
 this.



 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




-- 
Jon Stewart, Principal
(646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-03 Thread Jamie

Thanks Jon

I'll investigate your idea further.

It would be nice if, in future, the Lucene API could provide a 
searchAfter that takes a position (int).


Regards

Jamie

On 2014/06/03, 3:24 PM, Jon Stewart wrote:

With regards to pagination, is there a way for you to cache the
IndexSearcher, Query, and TopDocs between user pagination requests (a
lot of webapp frameworks have object caching mechanisms)? If so, you
may have luck with code like this:

   void ensureTopDocs(final int rank) throws IOException {
 if (StartDocIndex  rank) {
   Docs = Searcher.search(SearchQuery, TOP_DOCS_WINDOW);
   StartDocIndex = 0;
 }
 int len = Docs.scoreDocs.length;
 while (StartDocIndex + len = rank) {
   StartDocIndex += len;
   Docs = Searcher.searchAfter(Docs.scoreDocs[len - 1],
SearchQuery, TOP_DOCS_WINDOW);
   len = Docs.scoreDocs.length;
 }
   }

StartDocIndex is a member variable denoting the current rank of the
first item in TopDocs (Docs) window. I call this function before
each Document retrieval. The common case--of the user looking at the
first page of results or the user advancing to the next page--is quite
fast. But it still supports random access, albeit not in constant
time. OTOH, if your app is concurrent, most search queries will
probably be returned very quickly so the odd query that wants to jump
deep into the result set will have more of the server's resources
available to it.

Also, given the size of your result sets, you have to allocate a lot
of memory upfront which will then get gc'd after some time. From query
to query, you will have a decent amount of memory churn. This isn't
free. My guess is using Lucene's linear (search()  searchAfter())
pagination will perform faster than your current approach just based
upon not having to create such large arrays.

I'm not the Lucene expert that Robert is, but this has worked alright for me.




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: search performance

2014-06-03 Thread Toke Eskildsen
Jamie [ja...@mailarchiva.com] wrote:
 It would be nice if, in future, the Lucene API could provide a
 searchAfter that takes a position (int).

It would not really help with large result sets. At least not with the current 
underlying implementations. This is tied into your current performance problem, 
if I understand it correctly.

We seem to have isolated your performance problems to large (10M+) result sets, 
right?

Requesting the top X results in Lucene works internally by adding to a Priority 
Queue. The problem with PQs is that they work really well for small result sets 
and really bad for large result sets (note that result set refers to the 
collected documents, not to the amount of matching documents). PQs rearranges 
the internal structure each time a hit is entered that has a score = the 
lowest known score. With millions of documents in the result set, this happens 
all the time. Abstractly there is little difference between small result sets 
and large: O(n * log n) is fine scaling. In reality the rearrangements of the 
internal heap structure only works well when it is in CPU cache.


To test this, I created the tiny project https://github.com/tokee/luso 

It simulates the workflow (for an extremely loose value of 'simulates') you 
described with extraction of a large result set by filling a PQ of a given size 
with docIDs (ints) and scores (floats) and then extracting the ordered docIDs. 
Running it with different sizes shows how the PQ deteriorates on a 4 core i7 
with 8MB level 2 cache:

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args=pq 1 1000 1 100 
100 500 1000 2000 3000 4000

Starting 1 threads with extraction method pq
   1,000 docs in mean  15 ms,66 docs/ms.
  10,000 docs in mean  47 ms,   212 docs/ms.
 100,000 docs in mean  65 ms, 1,538 docs/ms.
 500,000 docs in mean 385 ms, 1,298 docs/ms.
   1,000,000 docs in mean 832 ms, 1,201 docs/ms.
   5,000,000 docs in mean   7,566 ms,   660 docs/ms.
  10,000,000 docs in mean  16,482 ms,   606 docs/ms.
  20,000,000 docs in mean  39,481 ms,   506 docs/ms.
  30,000,000 docs in mean  80,293 ms,   373 docs/ms.
  40,000,000 docs in mean 109,537 ms,   365 docs/ms.

As can be seen, relative performance (docs/ms) drops significantly when the 
document count increases. To add insult to injury, this deterioration patters 
is optimistic as the test was the only heavy job on my computer. Running 4 of 
these tests in parallel (1 per core) we would ideally expect about the same 
speed, but instead we get

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args=pq 4 1000 1 10 50 
100 500 1000 2000 3000 4000

Starting 4 threads with extraction method pq
   1,000 docs in mean  34 ms,29 docs/ms.
  10,000 docs in mean  70 ms,   142 docs/ms.
 100,000 docs in mean 102 ms,   980 docs/ms.
 500,000 docs in mean   1,340 ms,   373 docs/ms.
   1,000,000 docs in mean   2,564 ms,   390 docs/ms.
   5,000,000 docs in mean  19,464 ms,   256 docs/ms.
  10,000,000 docs in mean  49,985 ms,   200 docs/ms.
  20,000,000 docs in mean 112,321 ms,   178 docs/ms.
(I got tired of waiting and stopped after 20M docs)

The conclusion seems clear enough: Using PQ for millions of results will take a 
long time.


So what can be done? I added an alternative implementation where all the docIDs 
and scores are collected in two parallel arrays, then merge sorted after 
collection. That gave the results

MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args=ip 1 1000 1 10 50 
100 500 1000 2000 3000 4000
Starting 1 threads with extraction method ip
   1,000 docs in mean  15 ms,66 docs/ms.
  10,000 docs in mean  52 ms,   192 docs/ms.
 100,000 docs in mean  73 ms, 1,369 docs/ms.
 500,000 docs in mean 363 ms, 1,377 docs/ms.
   1,000,000 docs in mean 780 ms, 1,282 docs/ms.
   5,000,000 docs in mean   4,634 ms, 1,078 docs/ms.
  10,000,000 docs in mean   9,708 ms, 1,030 docs/ms.
  20,000,000 docs in mean  20,818 ms,   960 docs/ms.
  30,000,000 docs in mean  32,413 ms,   925 docs/ms.
  40,000,000 docs in mean  44,235 ms,   904 docs/ms.

Notice how the deterioration of relative speed is a lot less than for PQ. 
Running this with 4 threads gets us

 MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args=ip 4 1000 1 10 50 
100 500 1000 2000 3000 4000
Starting 4 threads with extraction method ip
   1,000 docs in mean  35 ms,28 docs/ms.
  10,000 docs in mean 221 ms,45 docs/ms.
 100,000 docs in mean 162 ms,   617 docs/ms.
 500,000 docs in mean 639 ms,   782 docs/ms.
   1,000,000 docs in mean   1,388 ms,   720 docs/ms.
   5,000,000 docs in mean   8,372 ms,   597 docs/ms.
  10,000,000 docs in mean  17,933 ms,   557 docs/ms.
  20,000,000 docs in mean  36,031 ms,   555 docs/ms.
  30,000,000 docs in mean  58,257 ms,   514 docs/ms.
  40,000,000 

How to approach indexing source code?

2014-06-03 Thread Johan Tibell
Hi,

I'd like to index (Haskell) source code. I've run the source code through a
compiler (GHC) to get rich information about each token (its type, fully
qualified name, etc) that I want to index (and later use when ranking).

I'm wondering how to approach indexing source code. I can see two possible
approaches:

 * Create a file containing all the metadata and write a custom
tokenizer/analyzer that processes the file. The file could use a simple
line-based format:

myFunction,1:12-1:22,my-package,defined-here,more-metadata
myFunction,5:11-5:21,my-package,used-here,more-metadata
...

The tokenizer would use CharTermAttribute to write the function name,
OffsetAttribute to write the source span, etc.

 * Use and IndexWriter to create a Document directly, as done here:
http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3

I'm new to Lucene so I can't quite tell which approach is more likely to
work well. Which way would you recommend?

Other things I'd like to do that might influence the answer:

 - Index several tokens at the same position, so I can index both the fully
qualified name (e.g. module.myFunction) and unqualified name (e.g.
myFunction) for a term.

-- Johan


Re: How to approach indexing source code?

2014-06-03 Thread Jack Krupansky
The first question for any search app should always be: How do you intend to 
query the data? That will in large part determine how you should index the 
data.


IOW, how do you intend to use the data? Be specific.

Provide some sample queries and then work backwards to how the data needs to 
be indexed.


-- Jack Krupansky

-Original Message- 
From: Johan Tibell

Sent: Tuesday, June 3, 2014 9:32 PM
To: java-user@lucene.apache.org
Subject: How to approach indexing source code?

Hi,

I'd like to index (Haskell) source code. I've run the source code through a
compiler (GHC) to get rich information about each token (its type, fully
qualified name, etc) that I want to index (and later use when ranking).

I'm wondering how to approach indexing source code. I can see two possible
approaches:

* Create a file containing all the metadata and write a custom
tokenizer/analyzer that processes the file. The file could use a simple
line-based format:

myFunction,1:12-1:22,my-package,defined-here,more-metadata
myFunction,5:11-5:21,my-package,used-here,more-metadata
...

The tokenizer would use CharTermAttribute to write the function name,
OffsetAttribute to write the source span, etc.

* Use and IndexWriter to create a Document directly, as done here:
http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3

I'm new to Lucene so I can't quite tell which approach is more likely to
work well. Which way would you recommend?

Other things I'd like to do that might influence the answer:

- Index several tokens at the same position, so I can index both the fully
qualified name (e.g. module.myFunction) and unqualified name (e.g.
myFunction) for a term.

-- Johan 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org