Re: Solr Heap, MMaps and Garbage Collection
New gen should be big enough to handle all allocations that have a lifetime of a single request, considering that you'll have multiple concurrent requests. If new gen is routinely overflowed, you can put short-lived objects in the old gen. Yes, you need to go to CMS. I have usually seen the hit rates on query results and doc caches to be fairly similar, with doc cache somewhat higher. Cache hit rates depend on the number of queries between updates. If you update once per day and get a million queries or so, your hit rates can get pretty good. 70-80% seems typical for doc cache on an infrequently updated index. We stay around 75% on our busiest 4m doc index. The query result cache is the most important, because it saves the most work. Ours stays around 20%, but I should spend some time improving that. The perm gen size is very big. I think we run with 128 Meg. wunder On Mar 2, 2014, at 10:54 PM, KNitin wrote: > Thanks, Walter > > Hit rate on the document caches is close to 70-80% and the filter caches > are a 100% hit (since most of our queries filter on the same fields but > have a different q parameter). Query result cache is not of great > importance to me since the hit rate their is almost negligible. > > Does it mean i need to increase the size of my filter and document cache > for large indices? > > The split up of my 25Gb heap usage is split as follows > > 1. 19 GB - Old Gen (100% pool utilization) > 2. 3 Gb - New Gen (50% pool utilization) > 3. 2.8 Gb - Perm Gen (I am guessing this is because of interned strings) > 4. Survivor space is in the order of 300-400 MB and is almost always 100% > full.(Is this a major issue?) > > We are also currently using Parallel GC collector but planning to move to > CMS for lesser stop-the-world gc times. If i increase the filter cache and > document cache entry sizes, they would also go to the Old gen right? > > A very naive question: How does increasing young gen going to help if we > know that solr is already pushing major caches and other objects to old gen > because of their nature? My young gen pool utilization is still well under > 50% > > > Thanks > Nitin > > > On Sun, Mar 2, 2014 at 9:31 PM, Walter Underwood wrote: > >> An LRU cache will always fill up the old generation. Old objects are >> ejected, and those are usually in the old generation. >> >> Increasing the heap size will not eliminate this. It will make major, stop >> the world collections longer. >> >> Increase the new generation size until the rate of old gen increase slows >> down. Then choose a total heap size to control the frequency (and duration) >> of major collections. >> >> We run with the new generation at about 25% of the heap, so 8GB total and >> a 2GB newgen. >> >> A 512 entry cache is very small for query results or docs. We run with 10K >> or more entries for those. The filter cache size depends on your usage. We >> have only a handful of different filter queries, so a tiny cache is fine. >> >> What is your hit rate on the caches? >> >> wunder >> >> On Mar 2, 2014, at 7:42 PM, KNitin wrote: >> >>> Hi >>> >>> I have very large index for a few collections and when they are being >>> queried, i see the Old gen space close to 100% Usage all the time. The >>> system becomes extremely slow due to GC activity right after that and it >>> gets into this cycle very often >>> >>> I have given solr close to 30G of heap in a 65 GB ram machine and rest is >>> given to RAm. I have a lot of hits in filter,query result and document >>> caches and the size of all the caches is around 512 entries per >>> collection.Are all the caches used by solr on or off heap ? >>> >>> >>> Given this scenario where GC is the primary bottleneck what is a good >>> recommended memory settings for solr? Should i increase the heap memory >>> (that will only postpone the problem before the heap becomes full again >>> after a while) ? Will memory maps help at all in this scenario? >>> >>> >>> Kindly advise on the best practices >>> Thanks >>> Nitin >> >> >> -- Walter Underwood wun...@wunderwood.org
Re: Solr Heap, MMaps and Garbage Collection
Actually, I haven't ever seen a PermGen with 2.8 GB. So you must have a very special use case with SOLR. For my little index with 60 million docs and 170GB index size I gave PermGen 82 MB and it is only using 50.6 MB for a single VM. Permanent Generation (PermGen) is completely separate from the heap. Permanent Generation (non-heap): The pool containing all the reflective data of the virtual machine itself, such as class and method objects. With Java VMs that use class data sharing, this generation is divided into read-only and read-write areas. Regards Bernd Am 03.03.2014 07:54, schrieb KNitin: > Thanks, Walter > > Hit rate on the document caches is close to 70-80% and the filter caches > are a 100% hit (since most of our queries filter on the same fields but > have a different q parameter). Query result cache is not of great > importance to me since the hit rate their is almost negligible. > > Does it mean i need to increase the size of my filter and document cache > for large indices? > > The split up of my 25Gb heap usage is split as follows > > 1. 19 GB - Old Gen (100% pool utilization) > 2. 3 Gb - New Gen (50% pool utilization) > 3. 2.8 Gb - Perm Gen (I am guessing this is because of interned strings) > 4. Survivor space is in the order of 300-400 MB and is almost always 100% > full.(Is this a major issue?) > > We are also currently using Parallel GC collector but planning to move to > CMS for lesser stop-the-world gc times. If i increase the filter cache and > document cache entry sizes, they would also go to the Old gen right? > > A very naive question: How does increasing young gen going to help if we > know that solr is already pushing major caches and other objects to old gen > because of their nature? My young gen pool utilization is still well under > 50% > > > Thanks > Nitin > > > On Sun, Mar 2, 2014 at 9:31 PM, Walter Underwood wrote: > >> An LRU cache will always fill up the old generation. Old objects are >> ejected, and those are usually in the old generation. >> >> Increasing the heap size will not eliminate this. It will make major, stop >> the world collections longer. >> >> Increase the new generation size until the rate of old gen increase slows >> down. Then choose a total heap size to control the frequency (and duration) >> of major collections. >> >> We run with the new generation at about 25% of the heap, so 8GB total and >> a 2GB newgen. >> >> A 512 entry cache is very small for query results or docs. We run with 10K >> or more entries for those. The filter cache size depends on your usage. We >> have only a handful of different filter queries, so a tiny cache is fine. >> >> What is your hit rate on the caches? >> >> wunder >> >> On Mar 2, 2014, at 7:42 PM, KNitin wrote: >> >>> Hi >>> >>> I have very large index for a few collections and when they are being >>> queried, i see the Old gen space close to 100% Usage all the time. The >>> system becomes extremely slow due to GC activity right after that and it >>> gets into this cycle very often >>> >>> I have given solr close to 30G of heap in a 65 GB ram machine and rest is >>> given to RAm. I have a lot of hits in filter,query result and document >>> caches and the size of all the caches is around 512 entries per >>> collection.Are all the caches used by solr on or off heap ? >>> >>> >>> Given this scenario where GC is the primary bottleneck what is a good >>> recommended memory settings for solr? Should i increase the heap memory >>> (that will only postpone the problem before the heap becomes full again >>> after a while) ? Will memory maps help at all in this scenario? >>> >>> >>> Kindly advise on the best practices >>> Thanks >>> Nitin >> >> >> > -- * Bernd FehlingBielefeld University Library Dipl.-Inform. (FH)LibTec - Library Technology Universitätsstr. 25 and Knowledge Management 33615 Bielefeld Tel. +49 521 106-4060 bernd.fehling(at)uni-bielefeld.de BASE - Bielefeld Academic Search Engine - www.base-search.net *
Re: Solr Heap, MMaps and Garbage Collection
Thanks, Walter Hit rate on the document caches is close to 70-80% and the filter caches are a 100% hit (since most of our queries filter on the same fields but have a different q parameter). Query result cache is not of great importance to me since the hit rate their is almost negligible. Does it mean i need to increase the size of my filter and document cache for large indices? The split up of my 25Gb heap usage is split as follows 1. 19 GB - Old Gen (100% pool utilization) 2. 3 Gb - New Gen (50% pool utilization) 3. 2.8 Gb - Perm Gen (I am guessing this is because of interned strings) 4. Survivor space is in the order of 300-400 MB and is almost always 100% full.(Is this a major issue?) We are also currently using Parallel GC collector but planning to move to CMS for lesser stop-the-world gc times. If i increase the filter cache and document cache entry sizes, they would also go to the Old gen right? A very naive question: How does increasing young gen going to help if we know that solr is already pushing major caches and other objects to old gen because of their nature? My young gen pool utilization is still well under 50% Thanks Nitin On Sun, Mar 2, 2014 at 9:31 PM, Walter Underwood wrote: > An LRU cache will always fill up the old generation. Old objects are > ejected, and those are usually in the old generation. > > Increasing the heap size will not eliminate this. It will make major, stop > the world collections longer. > > Increase the new generation size until the rate of old gen increase slows > down. Then choose a total heap size to control the frequency (and duration) > of major collections. > > We run with the new generation at about 25% of the heap, so 8GB total and > a 2GB newgen. > > A 512 entry cache is very small for query results or docs. We run with 10K > or more entries for those. The filter cache size depends on your usage. We > have only a handful of different filter queries, so a tiny cache is fine. > > What is your hit rate on the caches? > > wunder > > On Mar 2, 2014, at 7:42 PM, KNitin wrote: > > > Hi > > > > I have very large index for a few collections and when they are being > > queried, i see the Old gen space close to 100% Usage all the time. The > > system becomes extremely slow due to GC activity right after that and it > > gets into this cycle very often > > > > I have given solr close to 30G of heap in a 65 GB ram machine and rest is > > given to RAm. I have a lot of hits in filter,query result and document > > caches and the size of all the caches is around 512 entries per > > collection.Are all the caches used by solr on or off heap ? > > > > > > Given this scenario where GC is the primary bottleneck what is a good > > recommended memory settings for solr? Should i increase the heap memory > > (that will only postpone the problem before the heap becomes full again > > after a while) ? Will memory maps help at all in this scenario? > > > > > > Kindly advise on the best practices > > Thanks > > Nitin > > >
Re: stopwords issue with edismax
As I suggested, you have a couple of field that do not ignore stop words, so the stop word must be present in at least one of those fields: (number:of^3.0 | all_code:of^2.0) The solution would be to remove the "number" and "all_code" fields from qf. -- Jack Krupansky -Original Message- From: sureshrk19 Sent: Monday, March 3, 2014 1:05 AM To: solr-user@lucene.apache.org Subject: Re: stopwords issue with edismax Jack, Thanks for the reply. Yes. your observation is right. I see, stopwords are not being ignore at query time. Say, I'm searching for 'bank of america'. I'm expecting 'of' should not be the part of search. But, here I see 'of' is being sent. Same is the query syntax for 'OR' and 'AND' operators and 'OR' is returning results as expected. But in my case, I want to use 'AND'. Here is debug query information... "parsedquery":"(+((DisjunctionMaxQuery((ent_name:bank^7.0 | all_text:bank | number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0)) DisjunctionMaxQuery((number:of^3.0 | all_code:of^2.0)) DisjunctionMaxQuery((ent_name:america^7.0 | all_text:america | number:america^3.0 | party:america^3.0 | all_code:america^2.0 | name:america^5.0)))~3))/no_coord", "parsedquery_toString":"+(((ent_name:bank^7.0 | all_text:bank | number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0) (number:of^3.0 | all_code:of^2.0) (ent_name:america^7.0 | all_text:america | number:america^3.0 | party:america^3.0 | all_code:america^2.0 | name:america^5.0))~3)" Is there any reason why 'stopwords' are not being ignored. I checked schema.xml for filter and the same is present: -- View this message in context: http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120815.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: stopwords issue with edismax
Jack, Thanks for the reply. Yes. your observation is right. I see, stopwords are not being ignore at query time. Say, I'm searching for 'bank of america'. I'm expecting 'of' should not be the part of search. But, here I see 'of' is being sent. Same is the query syntax for 'OR' and 'AND' operators and 'OR' is returning results as expected. But in my case, I want to use 'AND'. Here is debug query information... "parsedquery":"(+((DisjunctionMaxQuery((ent_name:bank^7.0 | all_text:bank | number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0)) DisjunctionMaxQuery((number:of^3.0 | all_code:of^2.0)) DisjunctionMaxQuery((ent_name:america^7.0 | all_text:america | number:america^3.0 | party:america^3.0 | all_code:america^2.0 | name:america^5.0)))~3))/no_coord", "parsedquery_toString":"+(((ent_name:bank^7.0 | all_text:bank | number:bank^3.0 | party:bank^3.0 | all_code:bank^2.0 | name:bank^5.0) (number:of^3.0 | all_code:of^2.0) (ent_name:america^7.0 | all_text:america | number:america^3.0 | party:america^3.0 | all_code:america^2.0 | name:america^5.0))~3)" Is there any reason why 'stopwords' are not being ignored. I checked schema.xml for filter and the same is present: -- View this message in context: http://lucene.472066.n3.nabble.com/stopwords-issue-with-edismax-tp4120339p4120815.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Heap, MMaps and Garbage Collection
An LRU cache will always fill up the old generation. Old objects are ejected, and those are usually in the old generation. Increasing the heap size will not eliminate this. It will make major, stop the world collections longer. Increase the new generation size until the rate of old gen increase slows down. Then choose a total heap size to control the frequency (and duration) of major collections. We run with the new generation at about 25% of the heap, so 8GB total and a 2GB newgen. A 512 entry cache is very small for query results or docs. We run with 10K or more entries for those. The filter cache size depends on your usage. We have only a handful of different filter queries, so a tiny cache is fine. What is your hit rate on the caches? wunder On Mar 2, 2014, at 7:42 PM, KNitin wrote: > Hi > > I have very large index for a few collections and when they are being > queried, i see the Old gen space close to 100% Usage all the time. The > system becomes extremely slow due to GC activity right after that and it > gets into this cycle very often > > I have given solr close to 30G of heap in a 65 GB ram machine and rest is > given to RAm. I have a lot of hits in filter,query result and document > caches and the size of all the caches is around 512 entries per > collection.Are all the caches used by solr on or off heap ? > > > Given this scenario where GC is the primary bottleneck what is a good > recommended memory settings for solr? Should i increase the heap memory > (that will only postpone the problem before the heap becomes full again > after a while) ? Will memory maps help at all in this scenario? > > > Kindly advise on the best practices > Thanks > Nitin
Solr Heap, MMaps and Garbage Collection
Hi I have very large index for a few collections and when they are being queried, i see the Old gen space close to 100% Usage all the time. The system becomes extremely slow due to GC activity right after that and it gets into this cycle very often I have given solr close to 30G of heap in a 65 GB ram machine and rest is given to RAm. I have a lot of hits in filter,query result and document caches and the size of all the caches is around 512 entries per collection.Are all the caches used by solr on or off heap ? Given this scenario where GC is the primary bottleneck what is a good recommended memory settings for solr? Should i increase the heap memory (that will only postpone the problem before the heap becomes full again after a while) ? Will memory maps help at all in this scenario? Kindly advise on the best practices Thanks Nitin
Re: SolrCloud: heartbeat succeeding while node has failing SSD?
The heartbeat that keeps the node alive is the connection it maintains with ZooKeeper. We don’t currently have anything built in that will actively make sure each node can serve queries and remove it from clusterstatem.json if it cannot. If a replica is maintaining it’s connection with ZooKeeper and in most cases, if it is accepting updates, it will appear up. Load balancing should handle the failures, but I guess it depends on how sticky the request fails are. In the past, I’ve seen this handled on a different search engine by having a variety of external agent scripts that would occasionally attempt to do a query, and if things did not go right, it killed the process to cause it to try and startup again (supervised process). I’m not sure what the right long term feature for Solr is here, but feel free to start a JIRA issue around it. One simple improvement might even be a background thread that periodically checks some local readings and depending on the results, pulls itself out of the mix as best it can (remove itself from clusterstate.json or simply closes it’s zk conneciton). - Mark http://about.me/markrmiller On Mar 2, 2014, at 3:42 PM, Gregg Donovan wrote: > We had a brief SolrCloud outage this weekend when a node's SSD began to > fail but the node still appeared to be up to the rest of the SolrCloud > cluster (i.e. still green in clusterstate.json). Distributed queries that > reached this node would fail but whatever heartbeat keeps the node in the > clustrstate.json must have continued to succeed. > > We eventually had to power the node down to get it to be removed from > clusterstate.json. > > This is our first foray into SolrCloud, so I'm still somewhat fuzzy on what > the default heartbeat mechanism is and how we may augment it to be sure > that the disk is checked as part of the heartbeat and/or we verify that it > can serve queries. > > Any pointers would be appreciated. > > Thanks! > > --Gregg
SEVERE: org.apache.solr.common.SolrException: no field name specified in query and no default specified via 'df' param
Hi, I'm using Solr 4.0 Final (yes, I know I need to upgrade) I'm getting this error: SEVERE: org.apache.solr.common.SolrException: no field name specified in query and no default specified via 'df' param And I applied this fix: https://issues.apache.org/jira/browse/SOLR-3646 And unfortunately, the error persists. I'm using a multi shard environment and the error is only happening on one of the shards. I've already updated about half of the other shards with the missing default text in /browse but the error persists on that one shard. Can anyone tell me how to make the error go away? Thanks, -- View this message in context: http://lucene.472066.n3.nabble.com/SEVERE-org-apache-solr-common-SolrException-no-field-name-specified-in-query-and-no-default-specifiem-tp4120789.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr is NoSQL database or not?
On 3/1/2014 6:53 PM, Jack Krupansky wrote: NoSQL? To me it's just a marketing term, like Big Data. Data store? That does imply support for persistence, as opposed to mere caching, but mere persistence doesn't assure that the store is suitable for use as a System of Record which is a requirement in my view for a true database. So, I wouldn't assert that a data store is a database. I agree, Jack. Our experience has been that we don't actually need everything a true ACID "database" has to offer. In particular we don't care all that much about the I (isolation) part since we don't use Solr to store transactional data, just documents, which are loaded by a small number of writers that we coordinate. If I had to pick one thing though that would make you have to say well um not really a database, it would be the transactional model: anyone commits, everyone sees the updates. -Mike
SolrCloud: heartbeat succeeding while node has failing SSD?
We had a brief SolrCloud outage this weekend when a node's SSD began to fail but the node still appeared to be up to the rest of the SolrCloud cluster (i.e. still green in clusterstate.json). Distributed queries that reached this node would fail but whatever heartbeat keeps the node in the clustrstate.json must have continued to succeed. We eventually had to power the node down to get it to be removed from clusterstate.json. This is our first foray into SolrCloud, so I'm still somewhat fuzzy on what the default heartbeat mechanism is and how we may augment it to be sure that the disk is checked as part of the heartbeat and/or we verify that it can serve queries. Any pointers would be appreciated. Thanks! --Gregg
Re: Cluster state ranges are all null after reboot
Thanks again for the info. Hopefully we find some more clues if it continues to occur. The ops team are looking at alternative deployment methods as well, so we might end up avoiding the issue altogether. Ta, Greg On 28 February 2014 02:42, Shalin Shekhar Mangar wrote: > I think it is just a side-effect of the current implementation that > the ranges are assigned linearly. You can also verify this by choosing > a document from each shard and running it's uniqueKey against the > CompositeIdRouter's sliceHash method and verifying that it is included > in the range. > > I couldn't reproduce this but I didn't try too hard either. If you are > able to isolate a reproducible example then please do report back. > I'll spend some time to review the related code again to see if I can > spot the problem. > > On Thu, Feb 27, 2014 at 2:19 AM, Greg Pendlebury > wrote: > > Thanks Shalin, that code might be helpful... do you know if there is a > > reliable way to line up the ranges with the shard numbers? When the > problem > > occurred we had 80 million documents already in the index, and could not > > issue even a basic 'deleteById' call. I'm tempted to assume they are just > > assigned linearly since our Test and Prod clusters both look to work that > > way now, but I can't be sure whether that is by design or just > happenstance > > of boot order. > > > > And no, unfortunately we have not been able to reproduce this issue > > consistently despite trying a number of different things such as > graceless > > stop/start and screwing with the underlying WAR file (which is what we > > thought puppet might be doing). The problem has occurred twice since, but > > always in our Test environment. The fact that Test has only a single > > replica per shard is the most likely culprit for me, but as mentioned, > even > > gracelessly killing the last replica in the cluster seems to leave the > > range set correctly in clusterstate when we test it in isolation. > > > > In production (45 JVMs, 15 shards with 3 replicas each) we've never seen > > the problem, despite a similar number of rollouts for version changes > etc. > > > > Ta, > > Greg > > > > > > > > > > On 26 February 2014 23:46, Shalin Shekhar Mangar >wrote: > > > >> If you have 15 shards and assuming that you've never used shard > >> splitting, you can calculate the shard ranges by using new > >> CompositeIdRouter().partitionRange(15, new > >> CompositeIdRouter().fullRange()) > >> > >> This gives me: > >> [8000-9110, 9111-a221, a222-b332, > >> b333-c443, c444-d554, d555-e665, > >> e666-f776, f777-887, 888-1998, > >> 1999-2aa9, 2aaa-3bba, 3bbb-4ccb, > >> 4ccc-5ddc, 5ddd-6eed, 6eee-7fff] > >> > >> Have you done any more investigation into why this happened? Anything > >> strange in the logs? Are you able to reproduce this in a test > >> environment? > >> > >> On Wed, Feb 19, 2014 at 5:16 AM, Greg Pendlebury > >> wrote: > >> > We've got a 15 shard cluster spread across 3 hosts. This morning our > >> puppet > >> > software rebooted them all and afterwards the 'range' for each shard > has > >> > become null in zookeeper. Is there any way to restore this value > short of > >> > rebuilding a fresh index? > >> > > >> > I've read various questions from people with a similar problem, > although > >> in > >> > those cases it is usually a single shard that has become null allowing > >> them > >> > to infer what the value should be and manually fix it in ZK. In this > >> case I > >> > have no idea what the ranges should be. This is our test cluster, and > >> > checking production I can see that the ranges don't appear to be > >> > predictable based on the shard number. > >> > > >> > I'm also not certain why it even occurred. Our test cluster only has a > >> > single replica per shard, so when a JVM is rebooted the cluster is > >> > unavailable... would that cause this? Production has 3 replicas so we > can > >> > do rolling reboots. > >> > >> > >> > >> -- > >> Regards, > >> Shalin Shekhar Mangar. > >> > > > > -- > Regards, > Shalin Shekhar Mangar. >
Re: How to best handle search like Dave & David
If you are trying to serve results as users are typing, then you can use EdgeNGramFilter (see https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory ). Let's say you configure your field like this, as shown in the Solr wiki: Then this is what happens at index time for your tokens: David ---> | LowerCaseTokenizerFactory | ---> david ---> | EdgeNGramFilterFactory | ---> da dav davi david Dave ---> | LowerCaseTokenizerFactory | ---> dave ---> | EdgeNGramFilterFactory | ---> da dav dave And at query time, when your user enters 'Dav' it will match both those tokens. Note that the moment your user starts typing more, say 'davi' it won't match 'Dave' since you are doing edge N gramming only at index time and not at query time. You can also do edge N gramming at query time if you want 'Dave' to match 'David', probably keeping a larger minGramSize (in this case 3) to avoid noise (like say 'Dave' matching 'Dana' though with a lower score), but it will be expensive to do n-gramming at query time. On Fri, Feb 28, 2014 at 3:22 PM, Susheel Kumar < susheel.ku...@thedigitalgroup.net> wrote: > Hi, > > We have name searches on Solr for millions of documents. User may search > like "Morrison Dave" or other may search like "Morrison David". What's the > best way to handle that both brings similar results. Adding Synonym is the > option we are using right. > > But we may need to add around such 50,000+ synonyms for different names > for each specific name there can be couple of synonyms like for Richard, it > can be Rich, Rick, Richie etc. > > Any experience adding so many synonyms or any other thoughts? Stemming may > help in few situations but not like Dave and David. > > Thanks, > Susheel >
Re: Date query not returning results only some time
Erick, Thanks a lot for the detailed explanation. That clarified things for me better. On Sun, Mar 2, 2014 at 10:04 AM, Erick Erickson wrote: > Well, in M/S setups the master shouldn't be searching at all, > but that's a nit. > > That aside, whether the master has opened a new or > searcher or not is irrelevant to what the slave replicates. > What _is_ relevant is whether any of the files on disk that > comprise the index (i.e. the segment files) have been > changed. Really, if any of them have been closed/merged > whatever since the last sync. Imagine it like this (this isn't > quite what happens, but it's a useful model). The slave > says "here's a list of my segments, is it the same as the > list of closed segments on the master?" If the answer > is no, a replication is performed. Actually, this is done > much more efficiently, but that's the idea. > > You seem to be really asking about the whole issue of whether > searches on the various nodes (master + slaves) is > consistent. This is one of the problems with M/S setups, they > can be different by whatever has happened in the polling interval. > > The state of the master's searchers just doesn't enter the picture. > > Glad the problem is solved no matter what. > > Erick > > On Sat, Mar 1, 2014 at 10:26 PM, Arun Rangarajan > wrote: > >> The slave is polling the master after the interval specified in > > solrconfig.xml. The slave essentially asks "has anything changed?" If > so, the > > changes are brought down to the slave. > > Yes, I understand this, but if master does not open a new searcher after > > auto commits (which would indicate that the new index is not quite ready > > yet) and if master is still using the old index to serve search > requests, I > > would expect the slave to do the same as well. Or the slave should at > least > > not replicate or not open a new searcher, until the master opened a new > > searcher. But that is just the way I see it and it may be wrong. > > > >> What's your polling interval on the slave anyway? Sounds like it's quite > > frequent if you notice this immediately after the DIH starts. > > No, polling interval is set to 1 hour, but the full import was set to run > > at 1 AM. I believe a delete followed by few docs got replicated after the > > first few auto commits when the slave probably polled around 1:10 AM and > > slave index had few docs for an hour before the next polling happened, > > which is why the date query was returning empty results for exactly that > > one hour. (The full index takes about 1.5 hours to finish.) > > > > Anyway the problem is now solved by specifying "clean=false" in the DIH > > full import command. > > > > > > On Sat, Mar 1, 2014 at 9:12 AM, Erick Erickson >wrote: > > > >> bq: the slave anyway replicates the index after auto commits! (Is this > >> desired behavior?) > >> > >> Absolutely it's desired behavior. The slave is polling the master > >> after the interval > >> specified in solrconfig.xml. The slave essentially asks "has anything > >> changed?" If so, > >> the changes are brought down to the slave. And by definition, commits > >> change the index, > >> especially if all docs have been deleted > >> > >> What's your polling interval on the slave anyway? Sounds like it's > >> quite frequent if you > >> notice this immediately after the DIH starts. > >> > >> Best, > >> Erick > >> > >> On Fri, Feb 28, 2014 at 9:04 PM, Arun Rangarajan > >> wrote: > >> > I believe I figured out what the issue is. Even though we do not open > a > >> new > >> > searcher on master during full import, the slave anyway replicates the > >> > index after auto commits! (Is this desired behavior?) Since > "clean=true" > >> > this meant all the docs were deleted on slave and a partial index got > >> > replicated! The reason only the date query did not return any results > is > >> > because recently created docs have higher doc IDs and we index by > >> ascending > >> > order of IDs! > >> > > >> > I believe I have two options: > >> > - as Chris suggested I have to use "clean=false" so the existing docs > are > >> > not deleted first on the slave. Since we have primary keys, newly > added > >> > docs will overwrite old docs as they get added. > >> > - disable replication after commits. Replicate only after optimize. > >> > > >> > Thx all for your help. > >> > > >> > > >> > > >> > > >> > > >> > On Fri, Feb 28, 2014 at 8:06 PM, Arun Rangarajan > >> > wrote: > >> > > >> >> Thx, Erick and Chris. > >> >> > >> >> This is indeed very strange. Other queries which do not restrict by > the > >> >> date field are returning results, so the index is definitely not > empty. > >> Has > >> >> it got something to do with the date query part, with NOW/DAY or > >> something > >> >> in here? > >> >> first_publish_date:[NOW/DAY-33DAYS TO NOW/DAY-3DAYS] > >> >> > >> >> For now, I have set up a script to just log the number of docs on the > >> >> slave every minute. Will monitor and report the findings. > >> >> > >> >> > >> >
Re: Elevation and core create
Hmmm, you _ought_ to be able to specify a relative path in solrconfig_slave.xml:solrconfig.xml,x.xml,y.xml But there's certainly the chance that this is hard-coded in the query elevation component so I can't say that this'll work with assurance. Best, Erick On Sun, Mar 2, 2014 at 6:14 AM, David Stuart wrote: > Hi sorry for the cross post but I got no response in the dev group so assumed > I posted in the wrong place. > > > > I am using Solr 3.6 and am trying to automate the deployment of cores with a > custom elevate file. It is proving to be difficult as most of the file > (schema, stop words etc) support absolute path elevate seems to need to be in > either a conf directory as a sibling to data or in the data directory itself. > I am able to achieve my goal by having a secondary process that places the > file but thought I would as the group just in case I have missed the obvious. > Should I move to Solr 4 is it fixed here? I could also go down the root of > extending the SolrCore create function to accept additional params and move > the file into the defined data directory. > > Ideas? > > Thanks for your help > David Stuart > M +44(0) 778 854 2157 > T +44(0) 845 519 5465 > www.axistwelve.com > Axis12 Ltd | The Ivories | 6/18 Northampton Street, London | N1 2HY | UK > > AXIS12 - Enterprise Web Solutions > > Reg Company No. 7215135 > VAT No. 997 4801 60 > > This e-mail is strictly confidential and intended solely for the ordinary > user of the e-mail account to which it is addressed. If you have received > this e-mail in error please inform Axis12 immediately by return e-mail or > telephone. We advise that in keeping with good computing practice the > recipient of this e-mail should ensure that it is virus free. We do not > accept any responsibility for any loss or damage that may arise from the use > of this email or its contents. > > >
Re: Date query not returning results only some time
Well, in M/S setups the master shouldn't be searching at all, but that's a nit. That aside, whether the master has opened a new or searcher or not is irrelevant to what the slave replicates. What _is_ relevant is whether any of the files on disk that comprise the index (i.e. the segment files) have been changed. Really, if any of them have been closed/merged whatever since the last sync. Imagine it like this (this isn't quite what happens, but it's a useful model). The slave says "here's a list of my segments, is it the same as the list of closed segments on the master?" If the answer is no, a replication is performed. Actually, this is done much more efficiently, but that's the idea. You seem to be really asking about the whole issue of whether searches on the various nodes (master + slaves) is consistent. This is one of the problems with M/S setups, they can be different by whatever has happened in the polling interval. The state of the master's searchers just doesn't enter the picture. Glad the problem is solved no matter what. Erick On Sat, Mar 1, 2014 at 10:26 PM, Arun Rangarajan wrote: >> The slave is polling the master after the interval specified in > solrconfig.xml. The slave essentially asks "has anything changed?" If so, the > changes are brought down to the slave. > Yes, I understand this, but if master does not open a new searcher after > auto commits (which would indicate that the new index is not quite ready > yet) and if master is still using the old index to serve search requests, I > would expect the slave to do the same as well. Or the slave should at least > not replicate or not open a new searcher, until the master opened a new > searcher. But that is just the way I see it and it may be wrong. > >> What's your polling interval on the slave anyway? Sounds like it's quite > frequent if you notice this immediately after the DIH starts. > No, polling interval is set to 1 hour, but the full import was set to run > at 1 AM. I believe a delete followed by few docs got replicated after the > first few auto commits when the slave probably polled around 1:10 AM and > slave index had few docs for an hour before the next polling happened, > which is why the date query was returning empty results for exactly that > one hour. (The full index takes about 1.5 hours to finish.) > > Anyway the problem is now solved by specifying "clean=false" in the DIH > full import command. > > > On Sat, Mar 1, 2014 at 9:12 AM, Erick Erickson wrote: > >> bq: the slave anyway replicates the index after auto commits! (Is this >> desired behavior?) >> >> Absolutely it's desired behavior. The slave is polling the master >> after the interval >> specified in solrconfig.xml. The slave essentially asks "has anything >> changed?" If so, >> the changes are brought down to the slave. And by definition, commits >> change the index, >> especially if all docs have been deleted >> >> What's your polling interval on the slave anyway? Sounds like it's >> quite frequent if you >> notice this immediately after the DIH starts. >> >> Best, >> Erick >> >> On Fri, Feb 28, 2014 at 9:04 PM, Arun Rangarajan >> wrote: >> > I believe I figured out what the issue is. Even though we do not open a >> new >> > searcher on master during full import, the slave anyway replicates the >> > index after auto commits! (Is this desired behavior?) Since "clean=true" >> > this meant all the docs were deleted on slave and a partial index got >> > replicated! The reason only the date query did not return any results is >> > because recently created docs have higher doc IDs and we index by >> ascending >> > order of IDs! >> > >> > I believe I have two options: >> > - as Chris suggested I have to use "clean=false" so the existing docs are >> > not deleted first on the slave. Since we have primary keys, newly added >> > docs will overwrite old docs as they get added. >> > - disable replication after commits. Replicate only after optimize. >> > >> > Thx all for your help. >> > >> > >> > >> > >> > >> > On Fri, Feb 28, 2014 at 8:06 PM, Arun Rangarajan >> > wrote: >> > >> >> Thx, Erick and Chris. >> >> >> >> This is indeed very strange. Other queries which do not restrict by the >> >> date field are returning results, so the index is definitely not empty. >> Has >> >> it got something to do with the date query part, with NOW/DAY or >> something >> >> in here? >> >> first_publish_date:[NOW/DAY-33DAYS TO NOW/DAY-3DAYS] >> >> >> >> For now, I have set up a script to just log the number of docs on the >> >> slave every minute. Will monitor and report the findings. >> >> >> >> >> >> On Fri, Feb 28, 2014 at 6:49 PM, Chris Hostetter < >> hossman_luc...@fucit.org >> >> > wrote: >> >> >> >>> >> >>> : This is odd. The full import, I think, deletes the >> >>> : docs in the index when it starts. >> >>> >> >>> Yeah, if you are doing a full-import everyday, and you don't want it to >> >>> delete all docs when it starts, you need to specify "clearn=false" >> >>> >> >>> >
Re: SolrCloud plugin
Perhaps you just need StatsComponent? https://cwiki.apache.org/confluence/display/solr/The+Stats+Component On Sun, Mar 2, 2014 at 6:32 AM, Soumitra Kumar wrote: > In general, yes. > > I don't how SolrCloud serves a distributed query? What all it does on the > shards, and what on the server serving the query? > On Mar 1, 2014 2:58 PM, "Furkan KAMACI" wrote: > >> Hi; >> >> Ok, I see that your aim is different. Do you want to implement something >> similar to Map/Reduce paradigm? >> >> Thanks; >> Furkan KAMACI >> >> >> 2014-03-02 0:09 GMT+02:00 Soumitra Kumar : >> >> > I want to add a command to calculate average of some numeric field. How >> do >> > I efficiently do this when data is split across multiple shards. I would >> > like to do the computation on each shard, and then aggregate the result. >> > >> > >> > On Sat, Mar 1, 2014 at 1:51 PM, Furkan KAMACI > > >wrote: >> > >> > > Hi; >> > > >> > > I've written a dashboard for such kind of purposes and I will make it >> > open >> > > source soon. You can get information of SolrCloud via Solrj or you >> > interact >> > > with Zookeeper. Could you explain more what do you want to do? Which >> kind >> > > of results do you want to aggregate for SolrCloud installation. >> > > >> > > Thanks; >> > > Furkan KAMACI >> > > >> > > >> > > 2014-03-01 23:39 GMT+02:00 Soumitra Kumar : >> > > >> > > > Hello, >> > > > >> > > > I want to write a plugin for a SolrCloud installation. >> > > > >> > > > I could not find where and how to aggregate the results from all >> > shards, >> > > > please give some pointers. >> > > > >> > > > Thanks, >> > > > -Soumitra. >> > > > >> > > >> > >> -- Regards, Shalin Shekhar Mangar.
Elevation and core create
Hi sorry for the cross post but I got no response in the dev group so assumed I posted in the wrong place. I am using Solr 3.6 and am trying to automate the deployment of cores with a custom elevate file. It is proving to be difficult as most of the file (schema, stop words etc) support absolute path elevate seems to need to be in either a conf directory as a sibling to data or in the data directory itself. I am able to achieve my goal by having a secondary process that places the file but thought I would as the group just in case I have missed the obvious. Should I move to Solr 4 is it fixed here? I could also go down the root of extending the SolrCore create function to accept additional params and move the file into the defined data directory. Ideas? Thanks for your help David Stuart M +44(0) 778 854 2157 T +44(0) 845 519 5465 www.axistwelve.com Axis12 Ltd | The Ivories | 6/18 Northampton Street, London | N1 2HY | UK AXIS12 - Enterprise Web Solutions Reg Company No. 7215135 VAT No. 997 4801 60 This e-mail is strictly confidential and intended solely for the ordinary user of the e-mail account to which it is addressed. If you have received this e-mail in error please inform Axis12 immediately by return e-mail or telephone. We advise that in keeping with good computing practice the recipient of this e-mail should ensure that it is virus free. We do not accept any responsibility for any loss or damage that may arise from the use of this email or its contents.