Re: How to make good use of the multithreaded IndexSearcher?

2013-10-01 Thread Adrien Grand
Hi Benson,

On Mon, Sep 30, 2013 at 5:21 PM, Benson Margulies ben...@basistech.com wrote:
 The multithreaded index searcher fans out across segments. How aggressively
 does 'optimize' reduce the number of segments? If the segment count goes
 way down, is there some other way to exploit multiple cores?

forceMerge[1], formerly known as optimize, takes a parameter to
configure how many segments should remain in the index.

Regarding multi-core usage, if your query load is high enough to use
all you CPUs (there are alwas #cores queries running in parrallel),
there is generally no need to use the multi-threaded IndexSearcher.
The multi-threaded index searcher can however help in case all CPU
power is not in use or if you care more about latency than throughput.
It indeed leverages the fact that the index is splitted into segments
to parallelize query execution, so a fully merged index will actually
run the query in a single thread in any case.

There is no way to make query execution efficiently use several cores
on a single-segment index so if you really want to parallelize query
execution, you will have to shard the index to do at the index level
what the multi-threaded IndexSearcher does at the segment level.

Side notes:
 - A single segment index only runs more efficiently queries which are
terms-dictionary-intensive, it is generally discouraged to run
forceMerge on an index unless this index is read-only.
 - The multi-threaded index searcher only parallelizes query execution
in certain cases. In particular, it never parallelizes execution when
the method takes a collector. This means that if you want to use
TotalHitCountCollector to count matches, you will have to do the
parallelization by yourself.

[1] 
http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html#forceMerge%28int%29

-- 
Adrien

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to make good use of the multithreaded IndexSearcher?

2013-10-01 Thread Michael McCandless
You might want to set a smallish maxMergedSegmentMB in
TieredMergePolicy to force enough segments in the index ... sort of
the opposite of optimizing.

Really, IndexSearcher's approach to using one thread per segment is
rather silly, and, it's annoying/bad to expose change in behavior due
to segment structure.

I think it'd be better to carve up the overall docID space into N
virtual shards.  Ie, if you have 100M docs, then one thread searches
docs 0-10M, another 10M-20M, etc.  Nobody has created such a searcher
impl but it should not be hard and it would be agnostic to the segment
structure.

But then again, this need (using concurrent hardware to reduce latency
of a single query) is somewhat rare; most apps are fine using the
concurrency across queries rather than within one query.

Mike McCandless

http://blog.mikemccandless.com


On Tue, Oct 1, 2013 at 7:09 AM, Adrien Grand jpou...@gmail.com wrote:
 Hi Benson,

 On Mon, Sep 30, 2013 at 5:21 PM, Benson Margulies ben...@basistech.com 
 wrote:
 The multithreaded index searcher fans out across segments. How aggressively
 does 'optimize' reduce the number of segments? If the segment count goes
 way down, is there some other way to exploit multiple cores?

 forceMerge[1], formerly known as optimize, takes a parameter to
 configure how many segments should remain in the index.

 Regarding multi-core usage, if your query load is high enough to use
 all you CPUs (there are alwas #cores queries running in parrallel),
 there is generally no need to use the multi-threaded IndexSearcher.
 The multi-threaded index searcher can however help in case all CPU
 power is not in use or if you care more about latency than throughput.
 It indeed leverages the fact that the index is splitted into segments
 to parallelize query execution, so a fully merged index will actually
 run the query in a single thread in any case.

 There is no way to make query execution efficiently use several cores
 on a single-segment index so if you really want to parallelize query
 execution, you will have to shard the index to do at the index level
 what the multi-threaded IndexSearcher does at the segment level.

 Side notes:
  - A single segment index only runs more efficiently queries which are
 terms-dictionary-intensive, it is generally discouraged to run
 forceMerge on an index unless this index is read-only.
  - The multi-threaded index searcher only parallelizes query execution
 in certain cases. In particular, it never parallelizes execution when
 the method takes a collector. This means that if you want to use
 TotalHitCountCollector to count matches, you will have to do the
 parallelization by yourself.

 [1] 
 http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html#forceMerge%28int%29

 --
 Adrien

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: How to make good use of the multithreaded IndexSearcher?

2013-10-01 Thread Desidero
Benson,

Rather than forcing a random number of small segments into the index using
maxMergedSegmentMB, it might be better to split your index into multiple
shards. You can create a specific number of balanced shards to control the
parallelism and then forceMerge each shard down to 1 segment to avoid
spawning extra threads per shard. Once that's done, you just open all of
the shards with a MultiReader and use that with the IndexSearcher and an
ExecutorService.

The downside to this is that it doesn't play nicely with near real-time
search, but if you have a relatively static index that gets pushed to
slaves periodically it gets the job done.

As Mike said, it'd be nicer if there was a way to split the docID space
into virtual shards, but it's not currently available. I'm not sure if
anyone is even looking into it.

Regards,
Matt


On Tue, Oct 1, 2013 at 7:09 AM, Michael McCandless 
luc...@mikemccandless.com wrote:

 You might want to set a smallish maxMergedSegmentMB in
 TieredMergePolicy to force enough segments in the index ... sort of
 the opposite of optimizing.

 Really, IndexSearcher's approach to using one thread per segment is
 rather silly, and, it's annoying/bad to expose change in behavior due
 to segment structure.

 I think it'd be better to carve up the overall docID space into N
 virtual shards.  Ie, if you have 100M docs, then one thread searches
 docs 0-10M, another 10M-20M, etc.  Nobody has created such a searcher
 impl but it should not be hard and it would be agnostic to the segment
 structure.

 But then again, this need (using concurrent hardware to reduce latency
 of a single query) is somewhat rare; most apps are fine using the
 concurrency across queries rather than within one query.

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Oct 1, 2013 at 7:09 AM, Adrien Grand jpou...@gmail.com wrote:
  Hi Benson,
 
  On Mon, Sep 30, 2013 at 5:21 PM, Benson Margulies ben...@basistech.com
 wrote:
  The multithreaded index searcher fans out across segments. How
 aggressively
  does 'optimize' reduce the number of segments? If the segment count goes
  way down, is there some other way to exploit multiple cores?
 
  forceMerge[1], formerly known as optimize, takes a parameter to
  configure how many segments should remain in the index.
 
  Regarding multi-core usage, if your query load is high enough to use
  all you CPUs (there are alwas #cores queries running in parrallel),
  there is generally no need to use the multi-threaded IndexSearcher.
  The multi-threaded index searcher can however help in case all CPU
  power is not in use or if you care more about latency than throughput.
  It indeed leverages the fact that the index is splitted into segments
  to parallelize query execution, so a fully merged index will actually
  run the query in a single thread in any case.
 
  There is no way to make query execution efficiently use several cores
  on a single-segment index so if you really want to parallelize query
  execution, you will have to shard the index to do at the index level
  what the multi-threaded IndexSearcher does at the segment level.
 
  Side notes:
   - A single segment index only runs more efficiently queries which are
  terms-dictionary-intensive, it is generally discouraged to run
  forceMerge on an index unless this index is read-only.
   - The multi-threaded index searcher only parallelizes query execution
  in certain cases. In particular, it never parallelizes execution when
  the method takes a collector. This means that if you want to use
  TotalHitCountCollector to count matches, you will have to do the
  parallelization by yourself.
 
  [1]
 http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html#forceMerge%28int%29
 
  --
  Adrien
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: How to make good use of the multithreaded IndexSearcher?

2013-10-01 Thread Benson Margulies
On Tue, Oct 1, 2013 at 3:58 PM, Desidero desid...@gmail.com wrote:
 Benson,

 Rather than forcing a random number of small segments into the index using
 maxMergedSegmentMB, it might be better to split your index into multiple
 shards. You can create a specific number of balanced shards to control the
 parallelism and then forceMerge each shard down to 1 segment to avoid
 spawning extra threads per shard. Once that's done, you just open all of
 the shards with a MultiReader and use that with the IndexSearcher and an
 ExecutorService.

 The downside to this is that it doesn't play nicely with near real-time
 search, but if you have a relatively static index that gets pushed to
 slaves periodically it gets the job done.

 As Mike said, it'd be nicer if there was a way to split the docID space
 into virtual shards, but it's not currently available. I'm not sure if
 anyone is even looking into it.

Thanks, folks, for all the help. I'm musing about the top-level issue
here, which is whether the important case is many independent queries
or latency of just one.  In the case where it's just one, we'll follow
the shard-related advice.





 Regards,
 Matt


 On Tue, Oct 1, 2013 at 7:09 AM, Michael McCandless 
 luc...@mikemccandless.com wrote:

 You might want to set a smallish maxMergedSegmentMB in
 TieredMergePolicy to force enough segments in the index ... sort of
 the opposite of optimizing.

 Really, IndexSearcher's approach to using one thread per segment is
 rather silly, and, it's annoying/bad to expose change in behavior due
 to segment structure.

 I think it'd be better to carve up the overall docID space into N
 virtual shards.  Ie, if you have 100M docs, then one thread searches
 docs 0-10M, another 10M-20M, etc.  Nobody has created such a searcher
 impl but it should not be hard and it would be agnostic to the segment
 structure.

 But then again, this need (using concurrent hardware to reduce latency
 of a single query) is somewhat rare; most apps are fine using the
 concurrency across queries rather than within one query.

 Mike McCandless

 http://blog.mikemccandless.com


 On Tue, Oct 1, 2013 at 7:09 AM, Adrien Grand jpou...@gmail.com wrote:
  Hi Benson,
 
  On Mon, Sep 30, 2013 at 5:21 PM, Benson Margulies ben...@basistech.com
 wrote:
  The multithreaded index searcher fans out across segments. How
 aggressively
  does 'optimize' reduce the number of segments? If the segment count goes
  way down, is there some other way to exploit multiple cores?
 
  forceMerge[1], formerly known as optimize, takes a parameter to
  configure how many segments should remain in the index.
 
  Regarding multi-core usage, if your query load is high enough to use
  all you CPUs (there are alwas #cores queries running in parrallel),
  there is generally no need to use the multi-threaded IndexSearcher.
  The multi-threaded index searcher can however help in case all CPU
  power is not in use or if you care more about latency than throughput.
  It indeed leverages the fact that the index is splitted into segments
  to parallelize query execution, so a fully merged index will actually
  run the query in a single thread in any case.
 
  There is no way to make query execution efficiently use several cores
  on a single-segment index so if you really want to parallelize query
  execution, you will have to shard the index to do at the index level
  what the multi-threaded IndexSearcher does at the segment level.
 
  Side notes:
   - A single segment index only runs more efficiently queries which are
  terms-dictionary-intensive, it is generally discouraged to run
  forceMerge on an index unless this index is read-only.
   - The multi-threaded index searcher only parallelizes query execution
  in certain cases. In particular, it never parallelizes execution when
  the method takes a collector. This means that if you want to use
  TotalHitCountCollector to count matches, you will have to do the
  parallelization by yourself.
 
  [1]
 http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html#forceMerge%28int%29
 
  --
  Adrien
 
  -
  To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
  For additional commands, e-mail: java-user-h...@lucene.apache.org
 

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org