search performance

2014-06-02 Thread Jamie

Greetings

Despite following all the recommended optimizations (as described at 
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of 
our installations, search performance has reached the point where is it 
unacceptably slow. For instance, in one environment, the total index 
size is 200GB, with 150 million documents indexed. With NRT enabled, 
search speed is roughly 5 minutes on average. The server resources are: 
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.


The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to 
4.8.x. Is this likely to make any noticeable difference in performance?


Clearly, longer term, we need to move to a distributed search model. We 
thought to take advantage of the distributed search features offered in 
Solr, however, our solution is very tightly integrated into Lucene 
directly (since Solr didn't exist when we started out). Moving to Solr 
now seems like a daunting prospect. We've also following the Katta 
project with interest, but it doesn't appear support distributed 
indexing, and development on it seems to have stalled. It would be nice 
if there were a distributed search project on the Lucene level that we 
could use.


I realize this is a rather vague question, but are there any further 
suggestions on ways to improve search performance? We need cheap and 
dirty ideas, as well as longer term advice on a possible path forward.


Much appreciate

Jamie

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: remapping docIds in a read only offline built index

2014-06-02 Thread Olivier Binda

Hello, I'm still interested in having the answer to the following question :

In a 1-segment read-only index (that is built offline once and then 
frozen), is it possible to remap the docIds ?




I may have a (working but not optimal) answer to my original problem : I 
may use a MultiReader and 3 index to get the following composite index


docId   : document
-
1 : bookA
2 : bookB


M: linkA
M+1   : linkB
...
N+1   :  sentenceA
N+2   : sentenceB
...
30 :sentenceZZZ


This solution should be slower that if I only built 1 index while having 
the docId equal to the order in which I added the documents.











On 05/12/2014 06:01 PM, Olivier Binda wrote:
In a 1-segment (parallel) read-only index, that is built offline once 
(and then frozen),
is it possible to remap the docIds as the last step (i.e... to have 
the exact same index, except that the docIds are all equal to the ord 
the docs where added to the index) ?


Say I have the read only index

docId   : document
1 : bookB
2 : sentenceB
3 : linkA
4 : linkC
5 : sentenceC
6 : sentenceA
7 : bookA
...
30 : linkD

I would like to have instead the read-only index

docId   : document
1 : bookA
2 : bookB


M : linkA
M+1: linkB
...
N+1 : sentenceA
N+2 : sentenceB
...
30:sentenceZZZ

This would allow me to reduce the amount of ram to cache the type of 
each document


- without remapping, I need at least log2(types)* documents bits
here 2 * 30 bits

- with remapping, I need only to remember ints M and N

Also, if I need to cache 1 byte of metadata for each book

- without remapping, I would need 1 byte * documents
here 30 bytes

- with remapping, I would only need 1 byte * books
here M - 1 bytes


I tried building such an index with 
LogMergePolicy/NoMergePolicy/extending the ram buffer but (maybee I 
did something wrong),
the docIds were always reshuffled (maybee because my index was big and 
I was over a threshold)




Best regards,
Olivier

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






Re: remapping docIds in a read only offline built index

2014-06-02 Thread Michael McCandless
The index sorting APIs (in lucene/misc) can do this.  E.g. you could
make a SortingAtomicReader, with your sort criteria, then use
addIndexes(IR[]) to add it to a new index.  That resulting index would
have 1 segment and the docIDs would be in your order.

Mike McCandless

http://blog.mikemccandless.com


On Mon, May 12, 2014 at 12:01 PM, Olivier Binda
olivier.bi...@wanadoo.fr wrote:
 In a 1-segment (parallel) read-only index, that is built offline once (and
 then frozen),
 is it possible to remap the docIds as the last step (i.e... to have the
 exact same index, except that the docIds are all equal to the ord the docs
 where added to the index) ?

 Say I have the read only index

 docId   : document
 1 : bookB
 2 : sentenceB
 3 : linkA
 4 : linkC
 5 : sentenceC
 6 : sentenceA
 7 : bookA
 ...
 30 : linkD

 I would like to have instead the read-only index

 docId   : document
 1 : bookA
 2 : bookB
 

 M : linkA
 M+1: linkB
 ...
 N+1 : sentenceA
 N+2 : sentenceB
 ...
 30:sentenceZZZ

 This would allow me to reduce the amount of ram to cache the type of each
 document

 - without remapping, I need at least log2(types)* documents bits
 here 2 * 30 bits

 - with remapping, I need only to remember ints M and N

 Also, if I need to cache 1 byte of metadata for each book

 - without remapping, I would need 1 byte * documents
 here 30 bytes

 - with remapping, I would only need 1 byte * books
 here M - 1 bytes


 I tried building such an index with LogMergePolicy/NoMergePolicy/extending
 the ram buffer but (maybee I did something wrong),
 the docIds were always reshuffled (maybee because my index was big and I was
 over a threshold)



 Best regards,
 Olivier

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: remapping docIds in a read only offline built index

2014-06-02 Thread Olivier Binda

Very nice ! That is exactly what I needed. Thank you very much !


On 06/02/2014 09:26 AM, Michael McCandless wrote:

The index sorting APIs (in lucene/misc) can do this.  E.g. you could
make a SortingAtomicReader, with your sort criteria, then use
addIndexes(IR[]) to add it to a new index.  That resulting index would
have 1 segment and the docIDs would be in your order.

Mike McCandless

http://blog.mikemccandless.com


On Mon, May 12, 2014 at 12:01 PM, Olivier Binda
olivier.bi...@wanadoo.fr wrote:

In a 1-segment (parallel) read-only index, that is built offline once (and
then frozen),
is it possible to remap the docIds as the last step (i.e... to have the
exact same index, except that the docIds are all equal to the ord the docs
where added to the index) ?

Say I have the read only index

docId   : document
1 : bookB
2 : sentenceB
3 : linkA
4 : linkC
5 : sentenceC
6 : sentenceA
7 : bookA
...
30 : linkD

I would like to have instead the read-only index

docId   : document
1 : bookA
2 : bookB


M : linkA
M+1: linkB
...
N+1 : sentenceA
N+2 : sentenceB
...
30:sentenceZZZ

This would allow me to reduce the amount of ram to cache the type of each
document

- without remapping, I need at least log2(types)* documents bits
here 2 * 30 bits

- with remapping, I need only to remember ints M and N

Also, if I need to cache 1 byte of metadata for each book

- without remapping, I would need 1 byte * documents
here 30 bytes

- with remapping, I would only need 1 byte * books
here M - 1 bytes


I tried building such an index with LogMergePolicy/NoMergePolicy/extending
the ram buffer but (maybee I did something wrong),
the docIds were always reshuffled (maybee because my index was big and I was
over a threshold)



Best regards,
Olivier

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-02 Thread Tincu Gabriel
What kind of queries are you pushing into the index. Do they match a lot of
documents ? Do you do any sorting on the result set? What is the average
document size ? Do you have a lot of update traffic ? What kind of schema
does your index use ?


On Mon, Jun 2, 2014 at 6:51 AM, Jamie ja...@mailarchiva.com wrote:

 Greetings

 Despite following all the recommended optimizations (as described at
 http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of
 our installations, search performance has reached the point where is it
 unacceptably slow. For instance, in one environment, the total index size
 is 200GB, with 150 million documents indexed. With NRT enabled, search
 speed is roughly 5 minutes on average. The server resources are: 2x6 Core
 Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

 The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to
 4.8.x. Is this likely to make any noticeable difference in performance?

 Clearly, longer term, we need to move to a distributed search model. We
 thought to take advantage of the distributed search features offered in
 Solr, however, our solution is very tightly integrated into Lucene directly
 (since Solr didn't exist when we started out). Moving to Solr now seems
 like a daunting prospect. We've also following the Katta project with
 interest, but it doesn't appear support distributed indexing, and
 development on it seems to have stalled. It would be nice if there were a
 distributed search project on the Lucene level that we could use.

 I realize this is a rather vague question, but are there any further
 suggestions on ways to improve search performance? We need cheap and dirty
 ideas, as well as longer term advice on a possible path forward.

 Much appreciate

 Jamie

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: search performance

2014-06-02 Thread Jamie

Tom

Thanks for the offer of assistance.

On 2014/06/02, 12:02 PM, Tincu Gabriel wrote:

What kind of queries are you pushing into the index.

We are indexing regular emails + attachments.

Typical query is something like:
filter: to:mbox08 from:mbox08 cc:mbox08 bcc:mbox08 
deliveredto:mbox08 sender:mbox08 recipient:mbox08

combined with filter query cat:email

We also use range queries based on date.

Do they match a lot of documents ?

Yes, although we are using a  collector...

TopFieldCollector fieldCollector = TopFieldCollector.create(sort, 
max,true, false, false, true);


We use pagination, so only returning 1000 documents or so at a time.


  Do you do any sorting on the result set?

Yes

  What is the average
document size ?

approx 100KB, We are indexing email body + attachment content.

Do you have a lot of update traffic ?
Yes we have alot of update traffic, particularly in the environment i 
referred to. Is there a way to prioritize searching as apposed to update?


I suppose we could block all indexing while searching is on the go? Is 
there such as option in Lucene, or should we implement this?

What kind of schema
does your index use ?
Not sure exactly what you are referring to here. We do have alot of 
stored fields (to, from bcc, cc, etc.). The body and attachments are 
analyzed.


Regards

Jamie






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-02 Thread Jack Krupansky
Do you have enough system memory to fit the entire index in OS system memory 
so that the OS can fully cache it instead of thrashing with I/O? Do you see 
a lot of I/O or are the queries compute-bound?


You said you have a 128GB machine, so that sounds small for your index. Have 
you tried a 256GB machine?


How frequent are your commits for updates while doing queries?

-- Jack Krupansky

-Original Message- 
From: Jamie

Sent: Monday, June 2, 2014 2:51 AM
To: java-user@lucene.apache.org
Subject: search performance

Greetings

Despite following all the recommended optimizations (as described at
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of
our installations, search performance has reached the point where is it
unacceptably slow. For instance, in one environment, the total index
size is 200GB, with 150 million documents indexed. With NRT enabled,
search speed is roughly 5 minutes on average. The server resources are:
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.

The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to
4.8.x. Is this likely to make any noticeable difference in performance?

Clearly, longer term, we need to move to a distributed search model. We
thought to take advantage of the distributed search features offered in
Solr, however, our solution is very tightly integrated into Lucene
directly (since Solr didn't exist when we started out). Moving to Solr
now seems like a daunting prospect. We've also following the Katta
project with interest, but it doesn't appear support distributed
indexing, and development on it seems to have stalled. It would be nice
if there were a distributed search project on the Lucene level that we
could use.

I realize this is a rather vague question, but are there any further
suggestions on ways to improve search performance? We need cheap and
dirty ideas, as well as longer term advice on a possible path forward.

Much appreciate

Jamie

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-02 Thread Jamie

Jack

First off, thanks for applying your mind to our performance problem.

On 2014/06/02, 1:34 PM, Jack Krupansky wrote:
Do you have enough system memory to fit the entire index in OS system 
memory so that the OS can fully cache it instead of thrashing with 
I/O? Do you see a lot of I/O or are the queries compute-bound?
Nice idea. The index is 200GB, the machine currently has 128GB RAM. We 
are using SSDs, but disappointingly, installing them didn't reduce 
search times to acceptable levels. I'll have to check your last question 
regarding I/O... I assume it is I/O bound, though will double check...


Currently, we are using

fsDirectory = new NRTCachingDirectory(fsDir, 5.0, 60.0);

Are you proposing we increase maxCachedMB or use the RAMDirectory? With 
the latter, we will still need to persistent the index data to disk, as 
it is undergoing constant updates.


You said you have a 128GB machine, so that sounds small for your 
index. Have you tried a 256GB machine?
Nope..didn't think it would make much of a different. I suppose, 
assuming we could store the entire index in RAM it would be helpful. How 
does one do this with Lucene, while still persisting the data?


How frequent are your commits for updates while doing queries?

Around ten to fifteen documents are being constantly added per second.

Thank again

Jamie


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: MultiReader docid reliability

2014-06-02 Thread Nicola Buso
Hi Erick,

the good reason for now is caching, we use them to store the results in
cache, and I wanted a better explanation of ephemeral do understand
the possible life of the cache.
From the answers, ephemeral can be related to the opening of the
indexreader (in general for precaution) and all kind of modifications to
the index can be another interpretation.

Than it's not necessary, was just a matter of better understanding the
javadoc; I see the javadoc is the same for all the IndexReader than I
presume there are no differences from the various implementations.



nicola.


On Fri, 2014-05-30 at 12:50 -0700, Erick Erickson wrote:
 If you do an optimize, btw, the internal doc IDs may change.
 
 
 But _why_ do you want to keep them? You may have very good reasons,
 but it's not clear that this is necessary/desirable from what you've
 said so far...
 
 
 Best,
 Erick
 
 
 On Fri, May 30, 2014 at 7:49 AM, Nicola Buso nb...@ebi.ac.uk wrote:
 Hi,
 
 thanks Michael and Alan. Is enough to know that re-opening the
 index
 there is no guarantee that the docids are maintained also if
 the index
 does not change.
 
 And I will try the question also on the Solr mailinglist.
 
 
 nicola.
 
 
 On Fri, 2014-05-30 at 10:41 -0400, Michael Sokolov wrote:
  There is a Solr document cache that holds field values too,
 see:
  http://wiki.apache.org/solr/SolrCaching
 
  Maybe take this question over to the solr mailing list?
 
  -Mike
 
  On 5/30/2014 10:32 AM, Alan Woodward wrote:
   Solr caches hold lucene docids, which are invalidated
 every time a new searcher is opened.  The various fields for a
 response aren't cached as far as I know, they're reloaded on
 each request.  But loading the fields for 10 documents is
 typically very fast, compared to searching over a very large
 collection.
  
   Alan Woodward
   www.flax.co.uk
  
  
   On 30 May 2014, at 11:20, Nicola Buso wrote:
  
   Hi Alan,
  
   just to make it more typical (yes there are not
 IndexWriters open on
   that indexes) how solr is caching results? the first
 thing I would like
   to do is to store the docs ids and return to the reader
 for the real
   content. Is solr storing the whole results with all
 values?
  
  
   nicola.
  
  
   On Fri, 2014-05-30 at 11:05 +0100, Alan Woodward wrote:
   If the index is truly unchanging (ie there's no
 IndexWriter open on
   it) then I guess the document numbers will be stable
 across reopens.
   But this is a pretty specialized situation, and the docs
 are really
   there to warn you off trying to rely on this for more
 typical uses.
  
   Alan Woodward
   www.flax.co.uk
  
  
  
   On 30 May 2014, at 10:39, Nicola Buso wrote:
  
   Hi Alan,
  
   thanks a lot for the reply.
  
   For what I understood from your reply if the index is
 not changing
   (no
   adds, deletes even updates) the docs id viewed by the
 MultiReader
   will
   not change if you open more times that unchanged index
 also in
   different
   environments.
  
   If this is true (my understanding) the word ephemeral
 in the API
   could
   be elaborated a bit more.
  
  
   nicola
  
   On Fri, 2014-05-30 at 09:26 +0100, Alan Woodward wrote:
   Hi Nicola,
  
  
   1) A session here means as long as you have that
 MultiReader open.
   IndexReaders see a snapshot of the index and so
 document ids
   shouldn't change over the lifetime of an IndexReader,
 even if the
   index is being updated.
  
  
   2) MultiReader just takes an array of subindexes, so
 as long as
   the
   subindexes are passed to the MultiReader constructor
 in the same
   order
   on both machines, the docBase assigned to each reader
 context
   should
   be the same.
  
   Alan Woodward
   www.flax.co.uk
  
  
  
   On 29 May 2014, at 14:29, Nicola Buso wrote:
  
   Hi,
  
   from the javadocs:
  
   
   For efficiency, in this API documents are often
 referred to via
   document
   numbers, non-negative integers which each name a
   

Re: search performance

2014-06-02 Thread Tincu Gabriel
MMapDirectory will do the job for you. RamDirectory has a big warning in
the class description stating that the performance will get killed by an
index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
for RamDirectory and suitable for low update rates. MMap will use the
system RAM to cache as much of the index it can and only hit disk when the
portion of the index you're trying to access isn't cached. I'd put my money
on switching directory implementations and see what kind of performance
gains that brings to the table.


On Mon, Jun 2, 2014 at 11:50 AM, Jamie ja...@mailarchiva.com wrote:

 Jack

 First off, thanks for applying your mind to our performance problem.


 On 2014/06/02, 1:34 PM, Jack Krupansky wrote:

 Do you have enough system memory to fit the entire index in OS system
 memory so that the OS can fully cache it instead of thrashing with I/O? Do
 you see a lot of I/O or are the queries compute-bound?

 Nice idea. The index is 200GB, the machine currently has 128GB RAM. We are
 using SSDs, but disappointingly, installing them didn't reduce search times
 to acceptable levels. I'll have to check your last question regarding
 I/O... I assume it is I/O bound, though will double check...

 Currently, we are using

 fsDirectory = new NRTCachingDirectory(fsDir, 5.0, 60.0);

 Are you proposing we increase maxCachedMB or use the RAMDirectory? With
 the latter, we will still need to persistent the index data to disk, as it
 is undergoing constant updates.


 You said you have a 128GB machine, so that sounds small for your index.
 Have you tried a 256GB machine?

 Nope..didn't think it would make much of a different. I suppose, assuming
 we could store the entire index in RAM it would be helpful. How does one do
 this with Lucene, while still persisting the data?


 How frequent are your commits for updates while doing queries?

 Around ten to fifteen documents are being constantly added per second.

 Thank again


 Jamie


 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: search performance

2014-06-02 Thread Jamie
I was under the impression that NRTCachingDirectory will instantiate an 
MMapDirectory if a 64 bit platform is detected? Is this not the case?


On 2014/06/02, 2:09 PM, Tincu Gabriel wrote:

MMapDirectory will do the job for you. RamDirectory has a big warning in
the class description stating that the performance will get killed by an
index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
for RamDirectory and suitable for low update rates. MMap will use the
system RAM to cache as much of the index it can and only hit disk when the
portion of the index you're trying to access isn't cached. I'd put my money
on switching directory implementations and see what kind of performance
gains that brings to the table.





Re: search performance

2014-06-02 Thread Tincu Gabriel
My bad, It's using the RamDirectory as a cache and a delegate directory
that you pass in the constructor to do the disk operations, limiting the
use of the RamDirectory to files that fit a certain size. So i guess the
underlying Directory implementation will be whatever you choose it to be.
I'd still try using a MMapDirectory and see if that improves performance.
Also, regarding the pagination, you said you're retrieving 1000 documents
at a time. Does that mean that if a query matches 1 documents you want
all of them retrieved ?


On Mon, Jun 2, 2014 at 12:51 PM, Jamie ja...@mailarchiva.com wrote:

 I was under the impression that NRTCachingDirectory will instantiate an
 MMapDirectory if a 64 bit platform is detected? Is this not the case?


 On 2014/06/02, 2:09 PM, Tincu Gabriel wrote:

 MMapDirectory will do the job for you. RamDirectory has a big warning in
 the class description stating that the performance will get killed by an
 index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
 for RamDirectory and suitable for low update rates. MMap will use the
 system RAM to cache as much of the index it can and only hit disk when the
 portion of the index you're trying to access isn't cached. I'd put my
 money
 on switching directory implementations and see what kind of performance
 gains that brings to the table.





Re: search performance

2014-06-02 Thread Jamie
I assume you meant 1000 documents. Yes, the page size is in fact 
configurable. However, it only obtains the page size * 3. It preloads 
the following and previous page too. The point is, it only obtains the 
documents that are needed.



On 2014/06/02, 3:03 PM, Tincu Gabriel wrote:

My bad, It's using the RamDirectory as a cache and a delegate directory
that you pass in the constructor to do the disk operations, limiting the
use of the RamDirectory to files that fit a certain size. So i guess the
underlying Directory implementation will be whatever you choose it to be.
I'd still try using a MMapDirectory and see if that improves performance.
Also, regarding the pagination, you said you're retrieving 1000 documents
at a time. Does that mean that if a query matches 1 documents you want
all of them retrieved ?


On Mon, Jun 2, 2014 at 12:51 PM, Jamie ja...@mailarchiva.com wrote:


I was under the impression that NRTCachingDirectory will instantiate an
MMapDirectory if a 64 bit platform is detected? Is this not the case?


On 2014/06/02, 2:09 PM, Tincu Gabriel wrote:


MMapDirectory will do the job for you. RamDirectory has a big warning in
the class description stating that the performance will get killed by an
index larger than a few hundred MB, and NRTCachingDirectory is a wrapper
for RamDirectory and suitable for low update rates. MMap will use the
system RAM to cache as much of the index it can and only hit disk when the
portion of the index you're trying to access isn't cached. I'd put my
money
on switching directory implementations and see what kind of performance
gains that brings to the table.





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: search performance

2014-06-02 Thread Tri Cao

This is an interesting performance problem and I think there is probably not
a single answer here, so I'll just layout the steps I would take to tackle this:

1. What is the variance of the query latency? You said the average is 5 minutes,
but is it due to some really bad queries or most queries have the same perf?

2. We kind of assume that index size and number of docs is the issue here.
Can you validate that assumption by trying to index with 10M, 50M, … docs
and see how worse the performance is getting as a function of size?

3. What is the average doc hits for the bad queries? If you queries matches
a lot of hits, scoring will be very expensive. While you only ask for 1000 top
scored docs, Lucene still needs to score all the hits to get that 1000 docs.
If this is the case, there could be some work around, but Iet's make sure
that it's indeed the situation we are dealing with here.

Hope this helps,
Tri

On Jun 01, 2014, at 11:50 PM, Jamie ja...@mailarchiva.com wrote:

Greetings

Despite following all the recommended optimizations (as described at 
http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of 
our installations, search performance has reached the point where is it 
unacceptably slow. For instance, in one environment, the total index 
size is 200GB, with 150 million documents indexed. With NRT enabled, 
search speed is roughly 5 minutes on average. The server resources are: 
2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux.


The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to 
4.8.x. Is this likely to make any noticeable difference in performance?


Clearly, longer term, we need to move to a distributed search model. We 
thought to take advantage of the distributed search features offered in 
Solr, however, our solution is very tightly integrated into Lucene 
directly (since Solr didn't exist when we started out). Moving to Solr 
now seems like a daunting prospect. We've also following the Katta 
project with interest, but it doesn't appear support distributed 
indexing, and development on it seems to have stalled. It would be nice 
if there were a distributed search project on the Lucene level that we 
could use.


I realize this is a rather vague question, but are there any further 
suggestions on ways to improve search performance? We need cheap and 
dirty ideas, as well as longer term advice on a possible path forward.


Much appreciate

Jamie

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Possible order violation in lucene library version 2.4.1

2014-06-02 Thread Swarnendu Biswas
Hi,





I am working on a research project on data race detection, and am using the 
DaCapo benchmarks for evaluation. I am using the benchmark lusearch from the 
2009 suite, which uses lucene library 2.4.1.

For one test case, I am monitoring a pair of accesses say, 
Lorg/apache/lucene/store/Directory;.init ()V:40(6) and 
Lorg/apache/lucene/store/FSDirectory;.close ()V:524(1). The format is class 
name.method name method desc:line(byte code index).

During my work, I am getting AlreadyClosedExceptions on the FSDirectory from 
the ensureOpen() method for some threads, which I am guessing is probably due 
to an order violation. I have actually introduced delays in my instrumentation 
which delays threads that execute the code in 
Lorg/apache/lucene/store/FSDirectory;.close ()V. This is causing the other 
query threads to throw an exception.

Here is the exception trace:

org.apache.lucene.store.AlreadyClosedException: this Directory is closed
   at org.apache.lucene.store.Directory.ensureOpen(Directory.java:220)
   at org.apache.lucene.store.FSDirectory.list(FSDirectory.java:320)
   at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:533)
   at 
org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115)
   at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
   at org.apache.lucene.index.IndexReader.open(IndexReader.java:206)
   at org.dacapo.lusearch.Search$QueryProcessor.init(Search.java:207)
   at org.dacapo.lusearch.Search$QueryThread.run(Search.java:179)
org.apache.lucene.store.AlreadyClosedException: this Directory is closed
   at org.apache.lucene.store.Directory.ensureOpen(Directory.java:220)
   at org.apache.lucene.store.FSDirectory.list(FSDirectory.java:320)
   at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:533)
   at 
org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115)
   at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
   at org.apache.lucene.index.IndexReader.open(IndexReader.java:206)
   at org.dacapo.lusearch.Search$QueryProcessor.init(Search.java:207)
   at org.dacapo.lusearch.Search$QueryThread.run(Search.java:179)
java.lang.NullPointerException
   at org.dacapo.lusearch.Search$QueryProcessor.run(Search.java:226)
   at org.dacapo.lusearch.Search$QueryThread.run(Search.java:179)
java.lang.NullPointerException
   at org.dacapo.lusearch.Search$QueryProcessor.run(Search.java:226)
   at org.dacapo.lusearch.Search$QueryThread.run(Search.java:179)
org.apache.lucene.store.AlreadyClosedException: this Directory is closed
   at org.apache.lucene.store.Directory.ensureOpen(Directory.java:220)
   at org.apache.lucene.store.FSDirectory.list(FSDirectory.java:320)
   at 
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:533)
   at 
org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:115)
   at org.apache.lucene.index.IndexReader.open(IndexReader.java:316)
   at org.apache.lucene.index.IndexReader.open(IndexReader.java:206)
   at org.dacapo.lusearch.Search$QueryProcessor.init(Search.java:207)
   at org.dacapo.lusearch.Search$QueryThread.run(Search.java:179)
java.lang.NullPointerException
   at org.dacapo.lusearch.Search$QueryProcessor.run(Search.java:226)
   at org.dacapo.lusearch.Search$QueryThread.run(Search.java:179)

It would help if someone can give suggestions as to what can be going wrong 
here. I have verified that the issue isn't probably in my instrumentation. Just 
simply instrumenting the lucene source locations with sleeps() also reproduces 
the error.

--Regards,
Swarnendu Biswas.