Searching on Large Indexes

2014-06-27 Thread Sandeep Khanzode
Hi,

I have an index that runs into 200-300GB. It is not frequently updated.

What are the best strategies to query on this index?
1.] Should I, at index time, split the content, like a hash based partition, 
into multiple separate smaller indexes and aggregate the results 
programmatically?
2.] Should I replicate this index and provide some sort of document ID, and 
search on each node for a specific range of document IDs?
3.] Is there any way I can split or move individual segments to different nodes 
and aggregate the results?

I am not fully aware of the large scale query strategies. Can you please share 
your findings or experiences? Thanks, 
 
---
Thanks n Regards,
Sandeep Ramesh Khanzode

Re: Searching on Large Indexes

2014-06-27 Thread Jigar Shah
Some points based on my experience.

You can think of SolrCloud implementation, if  you want to distribute your
index over multiple servers.

Use MMapDirectory locally for each Solr instance in cluster.
Hit warm-up query on sever start-up. So most of the documents will be
cached, you will start saving on Disk IO on subsequent requests.
For e.g. If you have 4 Solr instances with 64GB RAM on each. most of your
documents will stay in RAM for 200GB index, and this will give you better
performance.

To take advantage of multi-core system. You can increase Searcher Threads,
ideally up-to the cores you have on single instance.




On Fri, Jun 27, 2014 at 4:03 PM, Sandeep Khanzode 
sandeep_khanz...@yahoo.com.invalid wrote:

 Hi,

 I have an index that runs into 200-300GB. It is not frequently updated.

 What are the best strategies to query on this index?
 1.] Should I, at index time, split the content, like a hash based
 partition, into multiple separate smaller indexes and aggregate the results
 programmatically?
 2.] Should I replicate this index and provide some sort of document ID,
 and search on each node for a specific range of document IDs?
 3.] Is there any way I can split or move individual segments to different
 nodes and aggregate the results?

 I am not fully aware of the large scale query strategies. Can you please
 share your findings or experiences? Thanks,

 ---
 Thanks n Regards,
 Sandeep Ramesh Khanzode


Re: Searching on Large Indexes

2014-06-27 Thread Toke Eskildsen
On Fri, 2014-06-27 at 12:33 +0200, Sandeep Khanzode wrote:
 I have an index that runs into 200-300GB. It is not frequently updated.

not frequently means different things for different people. Could you
give an approximate time span? If it is updated monthly, you might
consider a full optimization after update.

 What are the best strategies to query on this index?

 1.] Should I, at index time, split the content, like a hash based
 partition, into multiple separate smaller indexes and aggregate the
 results programmatically?

Assuming you use multiple machines or independent storage for the
multiple indexes, this will bring down latency. Do this if your searches
are too slow.

  2.] Should I replicate this index and provide some
 sort of document ID, and search on each node for a specific range of
 document IDs?

I don't really follow that idea. Are your searches only ID-based?

Anyway, replication increases throughput. Do this if your server have
trouble keeping up with the full amount of work.

  3.] Is there any way I can split or move individual
 segments to different nodes and aggregate the results?

Copy the full index. Delete all documents in copy 1 that matches one
half of your ID-hash function, do the reverse for the other. As your
corpus is semi-randomly distributed, scores should be comparable between
the indexes so that the result sets can be easily merged.

But at Jigar says, you should consider switching to SolrCloud (or
ElasticSearch) which does all this for you.

 I am not fully aware of the large scale query strategies. Can you
 please share your findings or experiences?

Depends on what you mean by large scale. You have a running system -
what do you want? Scaling up? Lowering latency? Increasing throughput?
More complex queries?

- Toke Eskildsen, State and University Library, Denmark



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org