Searching on Large Indexes
Hi, I have an index that runs into 200-300GB. It is not frequently updated. What are the best strategies to query on this index? 1.] Should I, at index time, split the content, like a hash based partition, into multiple separate smaller indexes and aggregate the results programmatically? 2.] Should I replicate this index and provide some sort of document ID, and search on each node for a specific range of document IDs? 3.] Is there any way I can split or move individual segments to different nodes and aggregate the results? I am not fully aware of the large scale query strategies. Can you please share your findings or experiences? Thanks, --- Thanks n Regards, Sandeep Ramesh Khanzode
Re: Searching on Large Indexes
Some points based on my experience. You can think of SolrCloud implementation, if you want to distribute your index over multiple servers. Use MMapDirectory locally for each Solr instance in cluster. Hit warm-up query on sever start-up. So most of the documents will be cached, you will start saving on Disk IO on subsequent requests. For e.g. If you have 4 Solr instances with 64GB RAM on each. most of your documents will stay in RAM for 200GB index, and this will give you better performance. To take advantage of multi-core system. You can increase Searcher Threads, ideally up-to the cores you have on single instance. On Fri, Jun 27, 2014 at 4:03 PM, Sandeep Khanzode sandeep_khanz...@yahoo.com.invalid wrote: Hi, I have an index that runs into 200-300GB. It is not frequently updated. What are the best strategies to query on this index? 1.] Should I, at index time, split the content, like a hash based partition, into multiple separate smaller indexes and aggregate the results programmatically? 2.] Should I replicate this index and provide some sort of document ID, and search on each node for a specific range of document IDs? 3.] Is there any way I can split or move individual segments to different nodes and aggregate the results? I am not fully aware of the large scale query strategies. Can you please share your findings or experiences? Thanks, --- Thanks n Regards, Sandeep Ramesh Khanzode
Re: Searching on Large Indexes
On Fri, 2014-06-27 at 12:33 +0200, Sandeep Khanzode wrote: I have an index that runs into 200-300GB. It is not frequently updated. not frequently means different things for different people. Could you give an approximate time span? If it is updated monthly, you might consider a full optimization after update. What are the best strategies to query on this index? 1.] Should I, at index time, split the content, like a hash based partition, into multiple separate smaller indexes and aggregate the results programmatically? Assuming you use multiple machines or independent storage for the multiple indexes, this will bring down latency. Do this if your searches are too slow. 2.] Should I replicate this index and provide some sort of document ID, and search on each node for a specific range of document IDs? I don't really follow that idea. Are your searches only ID-based? Anyway, replication increases throughput. Do this if your server have trouble keeping up with the full amount of work. 3.] Is there any way I can split or move individual segments to different nodes and aggregate the results? Copy the full index. Delete all documents in copy 1 that matches one half of your ID-hash function, do the reverse for the other. As your corpus is semi-randomly distributed, scores should be comparable between the indexes so that the result sets can be easily merged. But at Jigar says, you should consider switching to SolrCloud (or ElasticSearch) which does all this for you. I am not fully aware of the large scale query strategies. Can you please share your findings or experiences? Depends on what you mean by large scale. You have a running system - what do you want? Scaling up? Lowering latency? Increasing throughput? More complex queries? - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org