Re: Parallel optimize of index on SolrCloud.
More memory or faster disks will make a much bigger improvement than a forced merge. What are you measuring? If it is average query time, that is not a good measure. Look at 90th or 95th percentile. Test with queries from logs. No user can see a 10% or 20% difference. If your managers are watching that, they are watching the wrong thing. If you are indexing once per week, you don't really need the complexity of Solr Cloud. You can do manual sharding. wunder On Jul 8, 2014, at 10:55 PM, Modassar Ather modather1...@gmail.com wrote: Our index has almost 100M documents running on SolrCloud of 3 shards and each shard has an index size of about 700GB (for the record, we are not using stored fields - our documents are pretty large). We perform a full indexing every weekend and during the week there are no updates made to the index. Most of the queries that we run are pretty complex with hundreds of terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts etc. and take many minutes to execute. A difference of 10-20% is also a big advantage for us. We have been optimizing the index after indexing for years and it has worked well for us. Every once in a while, we upgrade Solr to the latest version and try without optimizing so that we can save the many hours it take to optimize such a huge index, but it does not work well. Kindly provide your suggestion. Thanks, Modassar On Wed, Jul 9, 2014 at 10:47 AM, Walter Underwood wun...@wunderwood.org wrote: I seriously doubt that you are required to force merge. How much improvement? And is the big performance cost also OK? I have worked on search engines that do automatic merges and offer forced merges for over fifteen years. For all that time, forced merges have usually caused problems. Stop doing forced merges. wunder On Jul 8, 2014, at 10:09 PM, Modassar Ather modather1...@gmail.com wrote: Thanks Walter for your inputs. Our use case and performance benchmark requires us to invoke optimize. Here we see a chance of improvement in performance of optimize() if invoked in parallel. I found that if* distrib=false *is used, the optimization will happen in parallel. But I could not find a way to set it using HttpSolrServer/CloudSolrServer. Also with the parameter setting as given in my mail above does not seems to work. Please let me know in what ways I can achieve the parallel optimize on SolrCloud. Thanks, Modassar On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood wun...@wunderwood.org wrote: You probably do not need to force merge (mistakenly called optimize) your index. Solr does automatic merges, which work just fine. There are only a few situations where a forced merge is even a good idea. The most common one is a replicated (non-cloud) setup with a full reindex every night. If you need Solr Cloud, I cannot think of a situation where you would want a forced merge. wunder On Jul 8, 2014, at 2:01 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Need to optimize index created using CloudSolrServer APIs under SolrCloud setup of 3 instances on separate machines. Currently it optimizes sequentially if I invoke cloudSolrServer.optimize(). To make it parallel I tried making three separate HttpSolrServer instances and invoked httpSolrServer.opimize() on them parallely but still it seems to be doing optimization sequentially. I tried invoking optimize directly using HttpPost with following url and parameters but still it seems to be sequential. *URL* : http://host:port/solr/collection/update *Parameters*: params.add(new BasicNameValuePair(optimize, true)); params.add(new BasicNameValuePair(maxSegments, 1)); params.add(new BasicNameValuePair(waitFlush, true)); params.add(new BasicNameValuePair(distrib, false)); Kindly provide your suggestion and help. Regards, Modassar -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org
Re: Parallel optimize of index on SolrCloud.
Hi Walter, I wonder why you think SolrCloud isn't necessary if you're indexing once per week. Isn't the automatic failover and auto-sharding still useful? One can also do custom sharding with SolrCloud if necessary. On Wed, Jul 9, 2014 at 11:38 AM, Walter Underwood wun...@wunderwood.org wrote: More memory or faster disks will make a much bigger improvement than a forced merge. What are you measuring? If it is average query time, that is not a good measure. Look at 90th or 95th percentile. Test with queries from logs. No user can see a 10% or 20% difference. If your managers are watching that, they are watching the wrong thing. If you are indexing once per week, you don't really need the complexity of Solr Cloud. You can do manual sharding. wunder On Jul 8, 2014, at 10:55 PM, Modassar Ather modather1...@gmail.com wrote: Our index has almost 100M documents running on SolrCloud of 3 shards and each shard has an index size of about 700GB (for the record, we are not using stored fields - our documents are pretty large). We perform a full indexing every weekend and during the week there are no updates made to the index. Most of the queries that we run are pretty complex with hundreds of terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts etc. and take many minutes to execute. A difference of 10-20% is also a big advantage for us. We have been optimizing the index after indexing for years and it has worked well for us. Every once in a while, we upgrade Solr to the latest version and try without optimizing so that we can save the many hours it take to optimize such a huge index, but it does not work well. Kindly provide your suggestion. Thanks, Modassar On Wed, Jul 9, 2014 at 10:47 AM, Walter Underwood wun...@wunderwood.org wrote: I seriously doubt that you are required to force merge. How much improvement? And is the big performance cost also OK? I have worked on search engines that do automatic merges and offer forced merges for over fifteen years. For all that time, forced merges have usually caused problems. Stop doing forced merges. wunder On Jul 8, 2014, at 10:09 PM, Modassar Ather modather1...@gmail.com wrote: Thanks Walter for your inputs. Our use case and performance benchmark requires us to invoke optimize. Here we see a chance of improvement in performance of optimize() if invoked in parallel. I found that if* distrib=false *is used, the optimization will happen in parallel. But I could not find a way to set it using HttpSolrServer/CloudSolrServer. Also with the parameter setting as given in my mail above does not seems to work. Please let me know in what ways I can achieve the parallel optimize on SolrCloud. Thanks, Modassar On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood wun...@wunderwood.org wrote: You probably do not need to force merge (mistakenly called optimize) your index. Solr does automatic merges, which work just fine. There are only a few situations where a forced merge is even a good idea. The most common one is a replicated (non-cloud) setup with a full reindex every night. If you need Solr Cloud, I cannot think of a situation where you would want a forced merge. wunder On Jul 8, 2014, at 2:01 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Need to optimize index created using CloudSolrServer APIs under SolrCloud setup of 3 instances on separate machines. Currently it optimizes sequentially if I invoke cloudSolrServer.optimize(). To make it parallel I tried making three separate HttpSolrServer instances and invoked httpSolrServer.opimize() on them parallely but still it seems to be doing optimization sequentially. I tried invoking optimize directly using HttpPost with following url and parameters but still it seems to be sequential. *URL* : http://host:port/solr/collection/update *Parameters*: params.add(new BasicNameValuePair(optimize, true)); params.add(new BasicNameValuePair(maxSegments, 1)); params.add(new BasicNameValuePair(waitFlush, true)); params.add(new BasicNameValuePair(distrib, false)); Kindly provide your suggestion and help. Regards, Modassar -- Walter Underwood wun...@wunderwood.org -- Walter Underwood wun...@wunderwood.org -- Regards, Shalin Shekhar Mangar.
Re: Parallel optimize of index on SolrCloud.
Hi All, Thanks for your kind suggestions and inputs. We have been going the optimize way and it has helped. There have been testing and benchmarking already done around memory and performance. So while optimizing we see a scope of improvement on it by doing it parallel so kindly suggest in what way it can be achieved. Thanks, Modassar On Wed, Jul 9, 2014 at 11:48 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Hi Walter, I wonder why you think SolrCloud isn't necessary if you're indexing once per week. Isn't the automatic failover and auto-sharding still useful? One can also do custom sharding with SolrCloud if necessary. On Wed, Jul 9, 2014 at 11:38 AM, Walter Underwood wun...@wunderwood.org wrote: More memory or faster disks will make a much bigger improvement than a forced merge. What are you measuring? If it is average query time, that is not a good measure. Look at 90th or 95th percentile. Test with queries from logs. No user can see a 10% or 20% difference. If your managers are watching that, they are watching the wrong thing. If you are indexing once per week, you don't really need the complexity of Solr Cloud. You can do manual sharding. wunder On Jul 8, 2014, at 10:55 PM, Modassar Ather modather1...@gmail.com wrote: Our index has almost 100M documents running on SolrCloud of 3 shards and each shard has an index size of about 700GB (for the record, we are not using stored fields - our documents are pretty large). We perform a full indexing every weekend and during the week there are no updates made to the index. Most of the queries that we run are pretty complex with hundreds of terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts etc. and take many minutes to execute. A difference of 10-20% is also a big advantage for us. We have been optimizing the index after indexing for years and it has worked well for us. Every once in a while, we upgrade Solr to the latest version and try without optimizing so that we can save the many hours it take to optimize such a huge index, but it does not work well. Kindly provide your suggestion. Thanks, Modassar On Wed, Jul 9, 2014 at 10:47 AM, Walter Underwood wun...@wunderwood.org wrote: I seriously doubt that you are required to force merge. How much improvement? And is the big performance cost also OK? I have worked on search engines that do automatic merges and offer forced merges for over fifteen years. For all that time, forced merges have usually caused problems. Stop doing forced merges. wunder On Jul 8, 2014, at 10:09 PM, Modassar Ather modather1...@gmail.com wrote: Thanks Walter for your inputs. Our use case and performance benchmark requires us to invoke optimize. Here we see a chance of improvement in performance of optimize() if invoked in parallel. I found that if* distrib=false *is used, the optimization will happen in parallel. But I could not find a way to set it using HttpSolrServer/CloudSolrServer. Also with the parameter setting as given in my mail above does not seems to work. Please let me know in what ways I can achieve the parallel optimize on SolrCloud. Thanks, Modassar On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood wun...@wunderwood.org wrote: You probably do not need to force merge (mistakenly called optimize) your index. Solr does automatic merges, which work just fine. There are only a few situations where a forced merge is even a good idea. The most common one is a replicated (non-cloud) setup with a full reindex every night. If you need Solr Cloud, I cannot think of a situation where you would want a forced merge. wunder On Jul 8, 2014, at 2:01 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Need to optimize index created using CloudSolrServer APIs under SolrCloud setup of 3 instances on separate machines. Currently it optimizes sequentially if I invoke cloudSolrServer.optimize(). To make it parallel I tried making three separate HttpSolrServer instances and invoked httpSolrServer.opimize() on them parallely but still it seems to be doing optimization sequentially. I tried invoking optimize directly using HttpPost with following url and parameters but still it seems to be sequential. *URL* : http://host:port/solr/collection/update *Parameters*: params.add(new BasicNameValuePair(optimize, true)); params.add(new BasicNameValuePair(maxSegments, 1)); params.add(new BasicNameValuePair(waitFlush, true)); params.add(new BasicNameValuePair(distrib, false)); Kindly provide your suggestion and help. Regards, Modassar -- Walter Underwood wun...@wunderwood.org --
Re: Parallel optimize of index on SolrCloud.
Hi Modassar, Have you tried hitting the cores for each replica directly (instead of using the collection)? i.e. if you had col_shard1_replica1 on node1, then send the optimize command to that core URL directly: curl -i -v http://host:port/solr/col_shard1_replica1/update; -H 'Content-type:application/xml' \ --data-binary optimize/ I haven't tried this myself but might work ;-) Tim On Wed, Jul 9, 2014 at 12:59 AM, Modassar Ather modather1...@gmail.com wrote: Hi All, Thanks for your kind suggestions and inputs. We have been going the optimize way and it has helped. There have been testing and benchmarking already done around memory and performance. So while optimizing we see a scope of improvement on it by doing it parallel so kindly suggest in what way it can be achieved. Thanks, Modassar On Wed, Jul 9, 2014 at 11:48 AM, Shalin Shekhar Mangar shalinman...@gmail.com wrote: Hi Walter, I wonder why you think SolrCloud isn't necessary if you're indexing once per week. Isn't the automatic failover and auto-sharding still useful? One can also do custom sharding with SolrCloud if necessary. On Wed, Jul 9, 2014 at 11:38 AM, Walter Underwood wun...@wunderwood.org wrote: More memory or faster disks will make a much bigger improvement than a forced merge. What are you measuring? If it is average query time, that is not a good measure. Look at 90th or 95th percentile. Test with queries from logs. No user can see a 10% or 20% difference. If your managers are watching that, they are watching the wrong thing. If you are indexing once per week, you don't really need the complexity of Solr Cloud. You can do manual sharding. wunder On Jul 8, 2014, at 10:55 PM, Modassar Ather modather1...@gmail.com wrote: Our index has almost 100M documents running on SolrCloud of 3 shards and each shard has an index size of about 700GB (for the record, we are not using stored fields - our documents are pretty large). We perform a full indexing every weekend and during the week there are no updates made to the index. Most of the queries that we run are pretty complex with hundreds of terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts etc. and take many minutes to execute. A difference of 10-20% is also a big advantage for us. We have been optimizing the index after indexing for years and it has worked well for us. Every once in a while, we upgrade Solr to the latest version and try without optimizing so that we can save the many hours it take to optimize such a huge index, but it does not work well. Kindly provide your suggestion. Thanks, Modassar On Wed, Jul 9, 2014 at 10:47 AM, Walter Underwood wun...@wunderwood.org wrote: I seriously doubt that you are required to force merge. How much improvement? And is the big performance cost also OK? I have worked on search engines that do automatic merges and offer forced merges for over fifteen years. For all that time, forced merges have usually caused problems. Stop doing forced merges. wunder On Jul 8, 2014, at 10:09 PM, Modassar Ather modather1...@gmail.com wrote: Thanks Walter for your inputs. Our use case and performance benchmark requires us to invoke optimize. Here we see a chance of improvement in performance of optimize() if invoked in parallel. I found that if* distrib=false *is used, the optimization will happen in parallel. But I could not find a way to set it using HttpSolrServer/CloudSolrServer. Also with the parameter setting as given in my mail above does not seems to work. Please let me know in what ways I can achieve the parallel optimize on SolrCloud. Thanks, Modassar On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood wun...@wunderwood.org wrote: You probably do not need to force merge (mistakenly called optimize) your index. Solr does automatic merges, which work just fine. There are only a few situations where a forced merge is even a good idea. The most common one is a replicated (non-cloud) setup with a full reindex every night. If you need Solr Cloud, I cannot think of a situation where you would want a forced merge. wunder On Jul 8, 2014, at 2:01 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Need to optimize index created using CloudSolrServer APIs under SolrCloud setup of 3 instances on separate machines. Currently it optimizes sequentially if I invoke cloudSolrServer.optimize(). To make it parallel I tried making three separate HttpSolrServer instances and invoked httpSolrServer.opimize() on them parallely but still it seems to be doing optimization sequentially. I tried invoking optimize directly using HttpPost with following url and parameters but still it seems
Re: Parallel optimize of index on SolrCloud.
On 7/9/2014 8:49 AM, Timothy Potter wrote: Hi Modassar, Have you tried hitting the cores for each replica directly (instead of using the collection)? i.e. if you had col_shard1_replica1 on node1, then send the optimize command to that core URL directly: curl -i -v http://host:port/solr/col_shard1_replica1/update; -H 'Content-type:application/xml' \ --data-binary optimize/ I haven't tried this myself but might work ;-) That doesn't work. It will optimize the whole collection, one core at a time. I thought that sending the optimize with distrib=false would limit the optimize to just the called core, but that also doesn't work. I thought a bug had been filed on the distrib=false problem, but it's been long enough that I'm no longer sure about that. Thanks, Shawn
Re: Parallel optimize of index on SolrCloud.
I think that’s pretty much a search time param, though it might end being used on the update side as well. In any case, I know it doesn’t affect commit or optimize. Also, to my knowledge, SolrCloud optimize support was never explicitly added or tested. -- Mark Miller about.me/markrmiller On July 9, 2014 at 12:00:27 PM, Shawn Heisey (s...@elyograg.org) wrote: I thought a bug had been filed on the distrib=false problem,
Re: Parallel optimize of index on SolrCloud.
You probably do not need to force merge (mistakenly called optimize) your index. Solr does automatic merges, which work just fine. There are only a few situations where a forced merge is even a good idea. The most common one is a replicated (non-cloud) setup with a full reindex every night. If you need Solr Cloud, I cannot think of a situation where you would want a forced merge. wunder On Jul 8, 2014, at 2:01 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Need to optimize index created using CloudSolrServer APIs under SolrCloud setup of 3 instances on separate machines. Currently it optimizes sequentially if I invoke cloudSolrServer.optimize(). To make it parallel I tried making three separate HttpSolrServer instances and invoked httpSolrServer.opimize() on them parallely but still it seems to be doing optimization sequentially. I tried invoking optimize directly using HttpPost with following url and parameters but still it seems to be sequential. *URL* : http://host:port/solr/collection/update *Parameters*: params.add(new BasicNameValuePair(optimize, true)); params.add(new BasicNameValuePair(maxSegments, 1)); params.add(new BasicNameValuePair(waitFlush, true)); params.add(new BasicNameValuePair(distrib, false)); Kindly provide your suggestion and help. Regards, Modassar
Re: Parallel optimize of index on SolrCloud.
Thanks Walter for your inputs. Our use case and performance benchmark requires us to invoke optimize. Here we see a chance of improvement in performance of optimize() if invoked in parallel. I found that if* distrib=false *is used, the optimization will happen in parallel. But I could not find a way to set it using HttpSolrServer/CloudSolrServer. Also with the parameter setting as given in my mail above does not seems to work. Please let me know in what ways I can achieve the parallel optimize on SolrCloud. Thanks, Modassar On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood wun...@wunderwood.org wrote: You probably do not need to force merge (mistakenly called optimize) your index. Solr does automatic merges, which work just fine. There are only a few situations where a forced merge is even a good idea. The most common one is a replicated (non-cloud) setup with a full reindex every night. If you need Solr Cloud, I cannot think of a situation where you would want a forced merge. wunder On Jul 8, 2014, at 2:01 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Need to optimize index created using CloudSolrServer APIs under SolrCloud setup of 3 instances on separate machines. Currently it optimizes sequentially if I invoke cloudSolrServer.optimize(). To make it parallel I tried making three separate HttpSolrServer instances and invoked httpSolrServer.opimize() on them parallely but still it seems to be doing optimization sequentially. I tried invoking optimize directly using HttpPost with following url and parameters but still it seems to be sequential. *URL* : http://host:port/solr/collection/update *Parameters*: params.add(new BasicNameValuePair(optimize, true)); params.add(new BasicNameValuePair(maxSegments, 1)); params.add(new BasicNameValuePair(waitFlush, true)); params.add(new BasicNameValuePair(distrib, false)); Kindly provide your suggestion and help. Regards, Modassar
Re: Parallel optimize of index on SolrCloud.
I seriously doubt that you are required to force merge. How much improvement? And is the big performance cost also OK? I have worked on search engines that do automatic merges and offer forced merges for over fifteen years. For all that time, forced merges have usually caused problems. Stop doing forced merges. wunder On Jul 8, 2014, at 10:09 PM, Modassar Ather modather1...@gmail.com wrote: Thanks Walter for your inputs. Our use case and performance benchmark requires us to invoke optimize. Here we see a chance of improvement in performance of optimize() if invoked in parallel. I found that if* distrib=false *is used, the optimization will happen in parallel. But I could not find a way to set it using HttpSolrServer/CloudSolrServer. Also with the parameter setting as given in my mail above does not seems to work. Please let me know in what ways I can achieve the parallel optimize on SolrCloud. Thanks, Modassar On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood wun...@wunderwood.org wrote: You probably do not need to force merge (mistakenly called optimize) your index. Solr does automatic merges, which work just fine. There are only a few situations where a forced merge is even a good idea. The most common one is a replicated (non-cloud) setup with a full reindex every night. If you need Solr Cloud, I cannot think of a situation where you would want a forced merge. wunder On Jul 8, 2014, at 2:01 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Need to optimize index created using CloudSolrServer APIs under SolrCloud setup of 3 instances on separate machines. Currently it optimizes sequentially if I invoke cloudSolrServer.optimize(). To make it parallel I tried making three separate HttpSolrServer instances and invoked httpSolrServer.opimize() on them parallely but still it seems to be doing optimization sequentially. I tried invoking optimize directly using HttpPost with following url and parameters but still it seems to be sequential. *URL* : http://host:port/solr/collection/update *Parameters*: params.add(new BasicNameValuePair(optimize, true)); params.add(new BasicNameValuePair(maxSegments, 1)); params.add(new BasicNameValuePair(waitFlush, true)); params.add(new BasicNameValuePair(distrib, false)); Kindly provide your suggestion and help. Regards, Modassar -- Walter Underwood wun...@wunderwood.org
Re: Parallel optimize of index on SolrCloud.
Our index has almost 100M documents running on SolrCloud of 3 shards and each shard has an index size of about 700GB (for the record, we are not using stored fields - our documents are pretty large). We perform a full indexing every weekend and during the week there are no updates made to the index. Most of the queries that we run are pretty complex with hundreds of terms using PhraseQuery, BooleanQuery, SpanQuery, Wildcards, boosts etc. and take many minutes to execute. A difference of 10-20% is also a big advantage for us. We have been optimizing the index after indexing for years and it has worked well for us. Every once in a while, we upgrade Solr to the latest version and try without optimizing so that we can save the many hours it take to optimize such a huge index, but it does not work well. Kindly provide your suggestion. Thanks, Modassar On Wed, Jul 9, 2014 at 10:47 AM, Walter Underwood wun...@wunderwood.org wrote: I seriously doubt that you are required to force merge. How much improvement? And is the big performance cost also OK? I have worked on search engines that do automatic merges and offer forced merges for over fifteen years. For all that time, forced merges have usually caused problems. Stop doing forced merges. wunder On Jul 8, 2014, at 10:09 PM, Modassar Ather modather1...@gmail.com wrote: Thanks Walter for your inputs. Our use case and performance benchmark requires us to invoke optimize. Here we see a chance of improvement in performance of optimize() if invoked in parallel. I found that if* distrib=false *is used, the optimization will happen in parallel. But I could not find a way to set it using HttpSolrServer/CloudSolrServer. Also with the parameter setting as given in my mail above does not seems to work. Please let me know in what ways I can achieve the parallel optimize on SolrCloud. Thanks, Modassar On Tue, Jul 8, 2014 at 7:53 PM, Walter Underwood wun...@wunderwood.org wrote: You probably do not need to force merge (mistakenly called optimize) your index. Solr does automatic merges, which work just fine. There are only a few situations where a forced merge is even a good idea. The most common one is a replicated (non-cloud) setup with a full reindex every night. If you need Solr Cloud, I cannot think of a situation where you would want a forced merge. wunder On Jul 8, 2014, at 2:01 AM, Modassar Ather modather1...@gmail.com wrote: Hi, Need to optimize index created using CloudSolrServer APIs under SolrCloud setup of 3 instances on separate machines. Currently it optimizes sequentially if I invoke cloudSolrServer.optimize(). To make it parallel I tried making three separate HttpSolrServer instances and invoked httpSolrServer.opimize() on them parallely but still it seems to be doing optimization sequentially. I tried invoking optimize directly using HttpPost with following url and parameters but still it seems to be sequential. *URL* : http://host:port/solr/collection/update *Parameters*: params.add(new BasicNameValuePair(optimize, true)); params.add(new BasicNameValuePair(maxSegments, 1)); params.add(new BasicNameValuePair(waitFlush, true)); params.add(new BasicNameValuePair(distrib, false)); Kindly provide your suggestion and help. Regards, Modassar -- Walter Underwood wun...@wunderwood.org