On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote:
[200GB, 150M documents]
With NRT enabled, search speed is roughly 5 minutes on average.
The server resources are:
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.
5 minutes is extremely long. Is that really the right number?
Toke
Thanks for the comment.
Unfortunately, in this instance, it is a live production system, so we
cannot conduct experiments. The number is definitely accurate.
We have many different systems with a similar load that observe the same
performance issue. To my knowledge, the Lucene
Can you take thread stacktraces (repeatedly) during those 5 minute
searches? That might give you (or someone on the mailing list) a clue
where all that time is spent.
You could try using jstack for that:
http://docs.oracle.com/javase/7/docs/technotes/tools/share/jstack.html
Regards
Christoph
On Tue, 2014-06-03 at 08:17 +0200, Jamie wrote:
Unfortunately, in this instance, it is a live production system, so we
cannot conduct experiments. The number is definitely accurate.
We have many different systems with a similar load that observe the same
performance issue. To my knowledge,
Toke
Thanks for the contact. See below:
On 2014/06/03, 9:17 AM, Toke Eskildsen wrote:
On Tue, 2014-06-03 at 08:17 +0200, Jamie wrote:
Unfortunately, in this instance, it is a live production system, so we
cannot conduct experiments. The number is definitely accurate.
We have many different
Something doesn't quite add up.
TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true,
false, false, true);
We use pagination, so only returning 1000 documents or so at a time.
You say you are using pagination, yet the API you are using to create your
collector isn't how
Vitaly
Thanks for the contribution. Unfortunately, we cannot use Lucene's
pagination function, because in reality the user can skip pages to start
the search at any point, not just from the end of the previous search.
Even the
first search (without any pagination), with a max of 1000 hits,
Hi Jamie,
What is included in the 5 minutes?
Just the call to the searcher?
seacher.search(...) ?
Can you show a bit more of the code you use?
On Tue, Jun 3, 2014 at 11:32 AM, Jamie ja...@mailarchiva.com wrote:
Vitaly
Thanks for the contribution. Unfortunately, we cannot use Lucene's
Sure... see below:
protected void search(Query query, Filter queryFilter, Sort sort)
throws BlobSearchException {
try {
logger.debug(start search {searchquery=' +
getSearchQuery() +
',query='+query.toString()+',filterQuery='+queryFilter+',sort='+sort+'});
FYI: We are also using a multireader to search over multiple index readers.
Search under a million documents yields good response times. When you
get into the 60M territory, search slows to a crawl.
On 2014/06/03, 11:47 AM, Jamie wrote:
Sure... see below:
A couple of questions.
1. What are you trying to achieve by setting the current thread's priority
to max possible value? Is it grabbing as much CPU time as possible? In my
experience, mucking with thread priorities like this is at best futile, and
at worst quite detrimental to responsiveness and
Vitaly
See below:
On 2014/06/03, 12:09 PM, Vitaly Funstein wrote:
A couple of questions.
1. What are you trying to achieve by setting the current thread's priority
to max possible value? Is it grabbing as much CPU time as possible? In my
experience, mucking with thread priorities like this is
Check and make sure you are not opening an indexreader for every
search. Be sure you don't do that.
On Mon, Jun 2, 2014 at 2:51 AM, Jamie ja...@mailarchiva.com wrote:
Greetings
Despite following all the recommended optimizations (as described at
Jamie,
What if you were to forget for a moment the whole pagination idea, and
always capped your search at 1000 results for testing purposes only? This
is just to try and pinpoint the bottleneck here; if, regardless of the
query parameters, the search latency stays roughly the same and well below
Hello,
If I have an AtomicReader, and an IndexSearcher can I reopen the index to
get the new documents?
Like there:
http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/index/IndexReader.html#reopen%28%29
There is any workaround?
Thanks,
Gergő
P.S.: I accidentaly send this to general
Sure, just use DirectoryReader.openIfChanged.
Mike McCandless
http://blog.mikemccandless.com
On Tue, Jun 3, 2014 at 6:36 AM, Gergő Törcsvári
torcsvari.ge...@gmail.com wrote:
Hello,
If I have an AtomicReader, and an IndexSearcher can I reopen the index to
get the new documents?
Like there:
Vitality / Robert
I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes.
Unless I am mistaken, the Lucene library's pagination mechanism, makes
the assumption that you will cache the scoredocs for the entire result
set. This is not practical when you have a result set that
No, you are incorrect. The point of a search engine is to return top-N
most relevant.
If you insist you need to open an indexreader on every single search,
and then return huge amounts of docs, maybe you should use a database
instead.
On Tue, Jun 3, 2014 at 6:42 AM, Jamie ja...@mailarchiva.com
Robert
Hmmm. why did Mike go to all the trouble of implementing NRT search,
if we are not supposed to be using it?
The user simply wants the latest result set. To me, this doesn't appear
out of scope for the Lucene project.
Jamie
On 2014/06/03, 1:17 PM, Robert Muir wrote:
No, you are
Robert
FYI: I've modified the code to utilize the experimental function..
DirectoryReader dirReader =
DirectoryReader.openIfChanged(cachedDirectoryReader,writer, true);
In this case, the IndexReader won't be opened on each search, unless
absolutely necessary.
Regards
Jamie
On
Reopening for every search is not a good idea. this will have an
extremely high cost (not as high as what you are doing with paging
but still not good).
Instead consider making it near-realtime, by doing this every second
or so instead. Look at SearcherManager for code that helps you do
this.
On
Robert. Thanks, I've already done a similar thing. Results on my test
platform are encouraging..
On 2014/06/03, 2:41 PM, Robert Muir wrote:
Reopening for every search is not a good idea. this will have an
extremely high cost (not as high as what you are doing with paging
but still not good).
With regards to pagination, is there a way for you to cache the
IndexSearcher, Query, and TopDocs between user pagination requests (a
lot of webapp frameworks have object caching mechanisms)? If so, you
may have luck with code like this:
void ensureTopDocs(final int rank) throws IOException {
Thanks Jon
I'll investigate your idea further.
It would be nice if, in future, the Lucene API could provide a
searchAfter that takes a position (int).
Regards
Jamie
On 2014/06/03, 3:24 PM, Jon Stewart wrote:
With regards to pagination, is there a way for you to cache the
IndexSearcher,
Jamie [ja...@mailarchiva.com] wrote:
It would be nice if, in future, the Lucene API could provide a
searchAfter that takes a position (int).
It would not really help with large result sets. At least not with the current
underlying implementations. This is tied into your current performance
Hi,
I'd like to index (Haskell) source code. I've run the source code through a
compiler (GHC) to get rich information about each token (its type, fully
qualified name, etc) that I want to index (and later use when ranking).
I'm wondering how to approach indexing source code. I can see two
The first question for any search app should always be: How do you intend to
query the data? That will in large part determine how you should index the
data.
IOW, how do you intend to use the data? Be specific.
Provide some sample queries and then work backwards to how the data needs to
be
27 matches
Mail list logo