RE: Solr working £ Symbol
>> We are using Solr to index our data. The data contains £ symbol within the >> text and for currency. When data is exported from the source system data >> contains £ symbol, however, when the data is imported into the Solr £ symbol >> is converted to . >> > >How can we keep the £ symbol as is when importing data? > >What tools are you using to look at Solr results? What tools are you using to >send update data to Solr? We have our application written in python which is using UTF-8 charset. We are using Solr post tool to send data to Solr. > >Solr expects and delivers UTF-8 characters. If the data you're sending to >Solr is using another character set, Java may not interpret it correctly. The JSON file generated does show £ symbol. The post tool used IMHO will use system LANG setting which is set to ' LANG=en_GB.UTF-8' > >Conversely, if whatever you're using to look at Solr's results is also not >expecting/displaying UTF-8, you might not be shown correct characters. When we check the data using the Solr webapp there also we cannot see the £ symbol. Regards, Mohan Disclaimer: www.arrkgroup.com/EmailDisclaimer
Re: SolrCloud replicaition
Shalin, Given the earlier response by Erick, wondering when this scenario occurs i.e. when the replica node recovers after a time period, wouldn’t it automatically recover all the missed updates by connecting to the leader? My understanding is the below from the responses so far (assuming replication factor of 2 for simplicity purposes): 1. Client tries an update request which is received by the shard leader 2. Leader once it updates on its own node, send the update to the unavailable replica node 3. Leader keeps trying to send the update to the replica node 4. After a while leader gives up and communicates to the client (not sure what kind of message will the client receive in this case?) 5. Replica node recovers and then realises that it needs to catch-up and hence receives all the updates in recovery mode Correct me if I am wrong in my understanding. Thnx!! On 3 May 2018 at 04:10:12, Shalin Shekhar Mangar (shalinman...@gmail.com) wrote: The min_rf parameter does not fail indexing. It only tells you how many replicas received the live update. So if the value is less than what you wanted then it is up to you to retry the update later. On Wed, May 2, 2018 at 3:33 PM, Greenhorn Techiewrote: > Hi, > > Good Morning!! > > In the case of a SolrCloud setup with sharing and replication in place, > when a document is sent for indexing, what happens when only the shard > leader has indexed the document, but the replicas failed, for whatever > reason. Will the document be resent by the leader to the replica shards to > index the document after sometime or how is scenario addressed? > > Also, given the above context, when I set the value of min_rf parameter to > say 2, does that mean the calling application will be informed that the > indexing failed? > -- Regards, Shalin Shekhar Mangar.
Re: SolrCloud replicaition
The min_rf parameter does not fail indexing. It only tells you how many replicas received the live update. So if the value is less than what you wanted then it is up to you to retry the update later. On Wed, May 2, 2018 at 3:33 PM, Greenhorn Techiewrote: > Hi, > > Good Morning!! > > In the case of a SolrCloud setup with sharing and replication in place, > when a document is sent for indexing, what happens when only the shard > leader has indexed the document, but the replicas failed, for whatever > reason. Will the document be resent by the leader to the replica shards to > index the document after sometime or how is scenario addressed? > > Also, given the above context, when I set the value of min_rf parameter to > say 2, does that mean the calling application will be informed that the > indexing failed? > -- Regards, Shalin Shekhar Mangar.
Re: Load balanced Solr cluster not updating leader
On 5/2/2018 6:23 PM, Erick Erickson wrote: > Perhaps this is: SOLR-11660? That definitely looks like the problem that Micheal describes. And it indicates that restarting Solr instances after restore is a workaround. The issue also says something that might indicate that collection reload after restore would also fix it, but I can't be sure about that part. If it works, that would be far less disruptive than a Solr restart. I've tried to reproduce the issue with the cloud example on 7.3.0, but I can't get the collection restore to work right and give me two replicas. Thanks, Shawn
Re: Load balanced Solr cluster not updating leader
Perhaps this is: SOLR-11660? On Wed, May 2, 2018 at 4:46 PM, Shawn Heiseywrote: > On 5/2/2018 3:52 PM, Michael B. Klein wrote: >> It works ALMOST perfectly. The restore operation reports success, and if I >> look at the UI, everything looks great in the Cloud graph view. All green, >> one leader and two other active instances per collection. >> >> But once we start updating, we run into problems. The two NON-leaders in >> each collection get the updates, but the leader never does. Since the >> instances are behind a round robin load balancer, every third query hits an >> out-of-date core, with unfortunate (for our near-real-time indexing >> dependent app) results. > > That is completely backwards from what I would expect in a problem > report. The leader coordinates all indexing, so if the two other > replicas are getting the updates, that means that at least part of the > functionality of the leader replica *IS* working. > > Side FYI: Unless you're using preferLocalShards=true, Solr will actually > load balance your load balanced requests. If your external load > balancer sends queries to replica1, replica1 may forward the request to > replica3 because of SolrCloud's own internal load balancing. The > preferLocalShards parameter will keep that from happening *if* the > machine receiving the query has the replicas required to satisfy the query. > >> Reloading the collection doesn't seem to help, but if I use the Collections >> API to DELETEREPLICA the leader of each collection and follow it with an >> ADDREPLICA, everything syncs up (with a new leader) and stays in sync from >> there on out. >> >> I don't know what to look for in my settings or my logs to diagnose or try >> to fix this issue. It only affects collections that have been restored from >> backup. Any suggestions or guidance would be a big help. > > I don't know what to look for in the logs either, but the first thing to > check for is any messages at WARN or ERROR logging levels. These kind > of messages should also show up in the admin UI logging tab, but > recovering the full text of those messages is much easier in the logfile > than the admin UI. > > Have you tried restarting the Solr instances after restoring the > collection? This shouldn't be required, but at this point I'm hoping to > at least get you limping along, even if it requires steps that are > obvious indications of a bug. > > Since you're running 6.6 and 6.x is in maintenance mode, it's not likely > that any bugs revealed will be fixed on 6.x, but maybe we can track it > down and see if it's still a problem in 7.x. How much pain will it > cause you to get upgraded? > > Also FYI: Two zookeeper servers is actually LESS fault tolerant than > only having one, because if either server goes down, quorum is lost. > You need at least three for fault tolerance. > > Thanks, > Shawn >
Re: Load balanced Solr cluster not updating leader
On 5/2/2018 3:52 PM, Michael B. Klein wrote: > It works ALMOST perfectly. The restore operation reports success, and if I > look at the UI, everything looks great in the Cloud graph view. All green, > one leader and two other active instances per collection. > > But once we start updating, we run into problems. The two NON-leaders in > each collection get the updates, but the leader never does. Since the > instances are behind a round robin load balancer, every third query hits an > out-of-date core, with unfortunate (for our near-real-time indexing > dependent app) results. That is completely backwards from what I would expect in a problem report. The leader coordinates all indexing, so if the two other replicas are getting the updates, that means that at least part of the functionality of the leader replica *IS* working. Side FYI: Unless you're using preferLocalShards=true, Solr will actually load balance your load balanced requests. If your external load balancer sends queries to replica1, replica1 may forward the request to replica3 because of SolrCloud's own internal load balancing. The preferLocalShards parameter will keep that from happening *if* the machine receiving the query has the replicas required to satisfy the query. > Reloading the collection doesn't seem to help, but if I use the Collections > API to DELETEREPLICA the leader of each collection and follow it with an > ADDREPLICA, everything syncs up (with a new leader) and stays in sync from > there on out. > > I don't know what to look for in my settings or my logs to diagnose or try > to fix this issue. It only affects collections that have been restored from > backup. Any suggestions or guidance would be a big help. I don't know what to look for in the logs either, but the first thing to check for is any messages at WARN or ERROR logging levels. These kind of messages should also show up in the admin UI logging tab, but recovering the full text of those messages is much easier in the logfile than the admin UI. Have you tried restarting the Solr instances after restoring the collection? This shouldn't be required, but at this point I'm hoping to at least get you limping along, even if it requires steps that are obvious indications of a bug. Since you're running 6.6 and 6.x is in maintenance mode, it's not likely that any bugs revealed will be fixed on 6.x, but maybe we can track it down and see if it's still a problem in 7.x. How much pain will it cause you to get upgraded? Also FYI: Two zookeeper servers is actually LESS fault tolerant than only having one, because if either server goes down, quorum is lost. You need at least three for fault tolerance. Thanks, Shawn
Re: Too many commits
On 5/2/2018 11:45 AM, Patrick Recchia wrote: > Is there any logging I can turn on to know when a commit happens and/or > when a segment is flushed? The normal INFO-level logging that Solr ships with will log all commits. It probably doesn't log segment flushes unless they happen as a result of a commit, though. The infoStream logging would have that information. Your autoCommit settings are ensuring that commitWithin is never going to actually cause a commit. Your interval for autoCommit is 6 (one minute), commitWithin is 50 (a little over eight minutes). The autoCommit has openSearcher set to true, so there will always be a commit with a new searcher occurring within one minute after an update is sent, and commitWithin will never be needed. Here's what I think I would try: On autoCommit, set openSearcher to false. If you want to have less than an eight minute window for document visibility, then reduce commitWithin to 12. Increase ramBufferSizeMB to 256 or 512, which might require an increase in heap size as well. Instead of using commitWithin, you could configure autoSoftCommit with a maxTime of 12. Here's some additional info about commits: https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ The title says "SolrCloud" but the concepts are equally applicable when not running in cloud mode. Thanks, Shawn
Re: Way for DataImportHandler to use bind variables
On 5/2/2018 1:03 PM, Mike Konikoff wrote: > Is there a way to configure the DataImportHandler to use bind variables for > the entity queries? To improve database performance. Can you clarify where these variables would come from and precisely what you want to do? >From what I can tell, you're talking about ? placeholders in a PreparedStatement. Is that correct? This works well for situations where you are writing JDBC code, but DIH is a configuration-based setup where the user cannot write the JDBC code. The only DIH-related code where PreparedStatement or prepareStatement appears is in a test for DIH, not in DIH code itself. I don't think DIH has any support for what you want, but until you clarify exactly what your intent is, I can't say for sure. Thanks, Shawn
Re: Faceting question
On 5/2/2018 2:56 PM, Weffelmeyer, Stacie wrote: > Question on faceting. We have a dynamicField that we want to facet > on. Below is the field and the type of information that field generates. > > > > cid:image001.png@01D3E22D.DE028870 > This image is not available. This mailing list will almost always strip attachments from email that it receives. > > "*customMetadata*":["{\"controlledContent\":{\"metadata\":{\"programs\":[\"program1\"],\"departments\":[\"department1\"],\"locations\":[\"location1\"],\"functions\":[\"function1\"],\"customTags\":[\"customTag1\",\"customTag2\"],\"corporate\":false,\"redline\":false},\"who\":{\"lastUpdateDate\":\"2018-04-26T14:35:02.268Z\",\"creationDate\":\"2018-04-26T14:35:01.445Z\",\"createdBy\":38853},\"clientOwners\":[38853],\"clientLastUpdateDate\":\"2018-04-25T21:15:06.000Z\",\"clientCreationDate\":\"2018-04-25T20:58:34.000Z\",\"clientContentId\":\"DOC-8030\",\"type\":{\"applicationId\":2574,\"code\":\"WI\",\"name\":\"Work > Instruction\",\"id\":\"5ac3d4d111570f0047a8ceb9\"},\"status\":\"active\",\"version\":1}}"], > I do not know what this is. It looks a little like JSON. But if it's json, there are a lot of escaped quotes in it, and I don't really know what I'm looking at. > > > It will always have customMetadata.controlledContent.metadata > > > > Then from metadata, it could be anything, which is why it is a > dynamicField. > > > > In this example there is > > customMetadata.controlledContent.metadata.programs > > customMetadata.controlledContent.metadata.departments > > customMetadata.controlledContent.metadata.locations > Solr does not have the concept of a nested data type. So how are you getting from all that text above to period-delimited strings in a hierarchy? If you're using some kind of custom plugin for Solr to have it support something it doesn't do out of the box, you're probably going to need to talk to the author of that plugin. Solr's dynamicField support is only dynamic in the sense that the precise field name is not found in the schema. The field name is dynamic. When it comes to what's IN the field, it doesn't matter whether it's a dynamic field or not. > If I enable faceting, it will do so with the field customMetadata. But > it doesn’t help because it separates every space as a term. But > ideally I want to facet on customMetadata.controlledContent.metadata. > Doing so brings back no facets. > > > > Is this possible? How can we best accomplish this? > We will need to understand exactly what you are indexing, what's in your schema, the exact query requests you are sending, and what you are expecting back. Thanks, Shawn
Re: Indexing throughput
On 5/2/2018 10:58 AM, Greenhorn Techie wrote: > The current hardware profile for our production cluster is 20 nodes, each > with 24cores and 256GB memory. Data being indexed is very structured in > nature and is about 30 columns or so, out of which half of them are > categorical with a defined list of values. The expected peak indexing > throughput is to be about *5* documents per second (expected to be done > at off-peak hours so that search requests will be minimal during this time) > and the average throughput around *1* documents (normal business > hours). > > Given the hardware profile, is it realistic and practical to achieve the > desired throughput? What factors affect the performance of indexing apart > from the above hardware characteristics? I understand that its very > difficult to provide any guidance unless a prototype is done. But wondering > what are the considerations and dependencies we need to be aware of and > whether our throughput expectations are realistic or not. 5 docs per second is not a slow indexing rate. It has been achieved, and as Erick noted, surpassed by a very large margin. Whether you can get there with your planned hardware on your index is not a question that I can answer. If I had to guess, I think that as long as the source system can push the data that fast, it SHOULD be possible to create an indexing system that can do it. The important thing to do for fast indexing with Solr is to have a lot of threads or processes indexing all at the same time. Indexing with a single thread will not achieve the fastest possible performance. Since you're planning SolrCloud, you should put some effort into having your indexing system be aware of your cluster state and the shard routing so that it can send indexing requests directly to shard leaders. Indexing is faster if Solr doesn't need to forward requests. The SolrJ client named "CloudSolrClient" is always aware of the clusterstate. So if you can use that, updates can always be sent to the leaders. Thanks, Shawn
Re: Introducing a stopword in a query causes ExtendedDismaxQueryParser to produce a radically different parsed query
This is a problem that we’ve noted too. This blog post discusses the underlying cause https://opensourceconnections.com/blog/2018/02/20/edismax-and-multiterm-synonyms-oddities/ Hope that helps On Wed, May 2, 2018 at 3:07 PM Chris Wiltwrote: > I began with a 7.2.1 solr instance using the techproducts sample data. > Next, I added "a" as a stopword (there were originally no stopwords). > > > > I tried two queries: "x a b" and "x b". > > Here is the raw query parameters: > q=x b=id,score,price=score desc=name^0.75 manu cat^3.0 > features^10.0=edismax > > > and > q=x a b=id,score,price=score desc=name^0.75 manu cat^3.0 > features^10.0=edismax > > > The idea is that I want different weights for the different fields, and I > want to be able to take the score of each term from its best field, i.e. > score the "x" from its match against the "cat" field and the "b" against > the "features" field. > > When I have "x b" I get this behavior exactly, with the parsed query as > follows: > +(((name:x)^0.75 | manu:x | (features:x)^10.0 | (cat:x)^3.0) > ((name:b)^0.75 | manu:b | (features:b)^10.0 | (cat:b)^3.0)) > > > When I use "x a b" I instead get: > +((name:x name:b)^0.75 | (manu:x manu:b) | (features:x features:b)^10.0 | > (cat:x cat:a cat:b)^3.0) > > > With the "x a b" query suppose document 1 matches "x" in "features" and > matches "b" in "cat". This document will get a single score based upon > either its "x" or its "b", but the score will not be the sum, as would have > been the case had the query been, "x b". > > > > > How do I get the edismax parser to behave the same for queries with > stopwords as it does without stopwords, keeping the behavior constant for > queries with no stopwords? > > > I tried using the stopwords parameter, but I get the same results with > that parameter taking the value of true or false. I also tried using the > tie parameter, but the tie parameter seem to change a max to a sum (it sums > up the scores of each field for each query term, rather than taking the max > of all fields for how well they match a query term). > -- CTO, OpenSource Connections Author, Relevant Search http://o19s.com/doug
Load balanced Solr cluster not updating leader
Hi all, I've encountered a reproducible and confusing issue with our Solr 6.6 cluster. (Updating to 7.x is an option, but not an immediate one.) This is in our staging environment, running on AWS. To save money, we scale our entire stack down to zero instances every night and spin it back up every morning. Here's the process: SCALE DOWN: 1) Commit & Optimize all collections. 2) Back up each collection to a shared volume (using the Collections API). 3) Spin down all (3) solr instances. 4) Spin down all (2) zookeeper instances. SPIN UP: 1) Spin up zookeeper instances; wait for the instances to find each other and the ensemble to stabilize. 2) Spin up solr instances; wait for them all to stabilize and for zookeeper to recognize them as live nodes. 3) Restore each collection (using the Collections API). It works ALMOST perfectly. The restore operation reports success, and if I look at the UI, everything looks great in the Cloud graph view. All green, one leader and two other active instances per collection. But once we start updating, we run into problems. The two NON-leaders in each collection get the updates, but the leader never does. Since the instances are behind a round robin load balancer, every third query hits an out-of-date core, with unfortunate (for our near-real-time indexing dependent app) results. Reloading the collection doesn't seem to help, but if I use the Collections API to DELETEREPLICA the leader of each collection and follow it with an ADDREPLICA, everything syncs up (with a new leader) and stays in sync from there on out. I don't know what to look for in my settings or my logs to diagnose or try to fix this issue. It only affects collections that have been restored from backup. Any suggestions or guidance would be a big help. Thanks, Michael -- Michael B. Klein Lead Developer, Repository Development and Administration Northwestern University Libraries
Faceting question
Hi, Question on faceting. We have a dynamicField that we want to facet on. Below is the field and the type of information that field generates. [cid:image001.png@01D3E22D.DE028870] "customMetadata":["{\"controlledContent\":{\"metadata\":{\"programs\":[\"program1\"],\"departments\":[\"department1\"],\"locations\":[\"location1\"],\"functions\":[\"function1\"],\"customTags\":[\"customTag1\",\"customTag2\"],\"corporate\":false,\"redline\":false},\"who\":{\"lastUpdateDate\":\"2018-04-26T14:35:02.268Z\",\"creationDate\":\"2018-04-26T14:35:01.445Z\",\"createdBy\":38853},\"clientOwners\":[38853],\"clientLastUpdateDate\":\"2018-04-25T21:15:06.000Z\",\"clientCreationDate\":\"2018-04-25T20:58:34.000Z\",\"clientContentId\":\"DOC-8030\",\"type\":{\"applicationId\":2574,\"code\":\"WI\",\"name\":\"Work Instruction\",\"id\":\"5ac3d4d111570f0047a8ceb9\"},\"status\":\"active\",\"version\":1}}"], It will always have customMetadata.controlledContent.metadata Then from metadata, it could be anything, which is why it is a dynamicField. In this example there is customMetadata.controlledContent.metadata.programs customMetadata.controlledContent.metadata.departments customMetadata.controlledContent.metadata.locations etc. If I enable faceting, it will do so with the field customMetadata. But it doesn’t help because it separates every space as a term. But ideally I want to facet on customMetadata.controlledContent.metadata. Doing so brings back no facets. Is this possible? How can we best accomplish this? Thank you, Stacie Weffelmeyer World Wide Technology, Inc.
Re: Median Date
All, percentiles only work with numbers, not dates. If I use the ms function, I can get the number of milliseconds between NOW and the import date. Then we can use that result in calculating the median age of the documents using percentiles. rows=0=true={!tag=piv1 percentiles='50' func}ms(NOW, importDate)=true={!stats=piv1 }status I hope this helps someone else :) Also, let me know if there's a better way to do this. Cheers, Jim On Tuesday, May 1, 2018 03:27:10 PM PDT, Jim Freebywrote: All, We have a dateImported field in our schema. I'd like to generate a statistic showing the median dateImported (actually we want median age of the documents, based on the dateImported value). I have other stats that calculate the median value of numbers (like price). This was achieved with something like: rows=0=true={!tag=piv1 percentiles='50'}price=true={!stats=piv1 }status I have not found a way to calculate the median dateImported. The mean works, but we need median. Any help would be appreciated? Cheers, Jim
RE: User queries end up in filterCache if facetting is enabled
Hello, Anyone here to reproduce this oddity? It shows up in all our collections once we enable the stats page to show filterCache entries. Is this normal? Am i completely missing something? Thanks, Markus -Original message- > From:Markus Jelsma> Sent: Tuesday 1st May 2018 17:32 > To: Solr-user > Subject: User queries end up in filterCache if facetting is enabled > > Hello, > > We noticed the number of entries of the filterCache to be higher than we > expected, using showItems="1024" something unexpected was listed as entries > of the filterCache, the complete Query.toString() of our user queries, > massive entries, a lot of them. > > We also spotted all entries of fields we facet on, even though we don't use > them as filtes, but that is caused by facet.field=enum, and should be > expected, right? > > Now, the user query entries are not expected. In the simplest set up, > searching for something and only enabling the facet engine with facet=true > causes it to appears in the cache as an entry. The following queries: > > http://localhost:8983/solr/search/select?q=content_nl:nog=true > http://localhost:8983/solr/search/select?q=*:*=true > > become listed as: > > CACHE.searcher.filterCache.item_*:*: > org.apache.solr.search.BitDocSet@70051ee0 > > CACHE.searcher.filterCache.item_content_nl:nog: > org.apache.solr.search.BitDocSet@13150cf6 > > This is on 7.3, but 7.2.1 does this as well. > > So, should i expect this? Can i disable this? Bug? > > > Thanks, > Markus > > > >
Re: Solr Heap usage
Thanks Shawn for the inputs, which will definitely help us to scale our cluster better. Regards On 2 May 2018 at 18:15:12, Shawn Heisey (apa...@elyograg.org) wrote: On 5/1/2018 5:33 PM, Greenhorn Techie wrote: > Wondering what are the considerations to be aware to arrive at an optimal > heap size for Solr JVM? Though I did discuss this on the IRC, I am still > unclear on how Solr uses the JVM heap space. Are there any pointers to > understand this aspect better? I'm one of the people you've been chatting with on IRC. I also wrote the wiki page that Susheel has recommended to you. > Given that Solr requires an optimally configured heap, so that the > remaining unused memory can be used for OS disk cache, I wonder how to best > configure Solr heap. Also, on the IRC it was discussed that having 31GB of > heap is better than having 32GB due to Java’s internal usage of heap. Can > anyone guide further on heap configuration please? With the index size you mentioned on IRC, it's very difficult to project how much heap you're going to need. Actually setting up a system, putting data on it, and firing real queries at it may be the only way to be sure. The only concrete advice I can give you with the information available is this: Install as much memory as you can. It is extremely unlikely that you would ever have too much memory when you're dealing with terabyte-scale indexes. Heavy indexing (which you have mentioned as a requirement in another thread) will tend to require a larger heap. Thanks, Shawn
Re: Indexing throughput
Thanks Walter and Erick for the valuable suggestions. We shall try out various values for shards and as well other tuning metrics I discussed in various threads earlier. Kind Regards On 2 May 2018 at 18:24:31, Erick Erickson (erickerick...@gmail.com) wrote: I've seen 1.5 M docs/second. Basically the indexing throughput is gated by two things: 1> the number of shards. Indexing throughput essentially scales up reasonably linearly with the number of shards. 2> the indexing program that pushes data to Solr. Before thinking Solr is the bottleneck, check how fast your ETL process is pushing docs. This pre-supposes using SolrJ and CloudSolrClient for the final push to Solr. This pre-buckets the updates and sends the updates for each shard to the shard leader, thus reducing the amount of work Solr has to do. If you use SolrJ, you can easily do <2> above by just commenting out the single call that pushes the docs to Solr in your program. Speaking of which, it's definitely best to batch the updates, see: https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/ Best, Erick On Wed, May 2, 2018 at 10:07 AM, Walter Underwoodwrote: > We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb RAM each > (EC2 C4.8xlarge). The collection is 24 million documents with four shards. The cluster > is Solr 6.6.2. All storage is SSD EBS. > > We built a simple batch loader in Java. We get about one million documents per minute > with 64 threads. We do not use the cloud-smart SolrJ client. We just send all the > batches to the load balancer and let Solr sort it out. > > You are looking for 3 million documents per minute. You will just have to test that. > > I haven’t tested it, but indexing should speed up linearly with the number of shards, > because those are indexing in parallel. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > >> On May 2, 2018, at 9:58 AM, Greenhorn Techie wrote: >> >> Hi, >> >> The current hardware profile for our production cluster is 20 nodes, each >> with 24cores and 256GB memory. Data being indexed is very structured in >> nature and is about 30 columns or so, out of which half of them are >> categorical with a defined list of values. The expected peak indexing >> throughput is to be about *5* documents per second (expected to be done >> at off-peak hours so that search requests will be minimal during this time) >> and the average throughput around *1* documents (normal business >> hours). >> >> Given the hardware profile, is it realistic and practical to achieve the >> desired throughput? What factors affect the performance of indexing apart >> from the above hardware characteristics? I understand that its very >> difficult to provide any guidance unless a prototype is done. But wondering >> what are the considerations and dependencies we need to be aware of and >> whether our throughput expectations are realistic or not. >> >> Thanks >
Introducing a stopword in a query causes ExtendedDismaxQueryParser to produce a radically different parsed query
I began with a 7.2.1 solr instance using the techproducts sample data. Next, I added "a" as a stopword (there were originally no stopwords). I tried two queries: "x a b" and "x b". Here is the raw query parameters: q=x b=id,score,price=score desc=name^0.75 manu cat^3.0 features^10.0=edismax and q=x a b=id,score,price=score desc=name^0.75 manu cat^3.0 features^10.0=edismax The idea is that I want different weights for the different fields, and I want to be able to take the score of each term from its best field, i.e. score the "x" from its match against the "cat" field and the "b" against the "features" field. When I have "x b" I get this behavior exactly, with the parsed query as follows: +(((name:x)^0.75 | manu:x | (features:x)^10.0 | (cat:x)^3.0) ((name:b)^0.75 | manu:b | (features:b)^10.0 | (cat:b)^3.0)) When I use "x a b" I instead get: +((name:x name:b)^0.75 | (manu:x manu:b) | (features:x features:b)^10.0 | (cat:x cat:a cat:b)^3.0) With the "x a b" query suppose document 1 matches "x" in "features" and matches "b" in "cat". This document will get a single score based upon either its "x" or its "b", but the score will not be the sum, as would have been the case had the query been, "x b". How do I get the edismax parser to behave the same for queries with stopwords as it does without stopwords, keeping the behavior constant for queries with no stopwords? I tried using the stopwords parameter, but I get the same results with that parameter taking the value of true or false. I also tried using the tie parameter, but the tie parameter seem to change a max to a sum (it sums up the scores of each field for each query term, rather than taking the max of all fields for how well they match a query term).
Way for DataImportHandler to use bind variables
Is there a way to configure the DataImportHandler to use bind variables for the entity queries? To improve database performance. Thanks, Mike
Re: Too many commits
Youcan turn on "infostream", but that is _very_ voluminous. The regular Solr logs at INFO level should show commits though On Wed, May 2, 2018 at 10:45 AM, Patrick Recchiawrote: > Swawn, > thanks you very much for your answer. > > > On Wed, May 2, 2018 at 6:27 PM, Shawn Heisey wrote: > >> On 5/2/2018 4:54 AM, Patrick Recchia wrote: >> > I'm seeing way too many commits on our solr cluster, and I don't know >> why. >> >> Are you sure there are commits happening? Do you have logs actually >> saying that a commit is occurring? The creation of a new segment does >> not necessarily mean a commit happened -- this can happen even without a >> commit. >> > > You're right, I assumed a new segment would be created only as part of a > commit; but I realize now that there can be other situations. > > Is there any logging I can turn on to know when a commit happens and/or > when a segment is flushed? > > I would be very interested in that > I've already enabled InfoStream logging from the IndexWriter, but have > found nothing yet there to help me understand that > > > >> > - IndexConfig is set to autoCommit every minute: >> > >> > ${solr.autoCommit.maxTime:6} < >> > openSearcher>true >> > >> > (solr.autoCommit.maxTime is not set) >> >> It's recommended to set openSearcher to false on autoCommit. Do you >> have autoSoftCommit configured? >> > > autoSoftCommit is left at its default '-1' (which means infinity, I > suppose). > > > >> >> > There is nothing else customized (when it comes to IndexWriter, at least) >> > within solrconfig.xml >> > >> > The data is sent without commit, but with commitWithin=50 ms. >> > >> > All that said, I would have expected a rate of about 1 segment created >> epr >> > minute; of about 100MB. >> >> One of the events that can cause a new segment to be flushed is the ram >> buffer filling up. Solr defaults to a ramBufferSizeMB value of 100. >> But that does not translate to a segment size of 100MB -- it's merely >> the size of the ram buffer that Lucene uses for all the work related to >> building a segment. A segment resulting from a full memory buffer is >> going to be smaller than the buffer. I do not know how MUCH smaller, or >> what causes variations in that size. >> >> The general advice is to leave the buffer size alone. But with the high >> volume you've got, you might want to increase it so segments are not >> flushed as frequently. Be aware that increasing it will have an impact >> on how much heap memory gets used. Every Solr core (shard replica in >> SolrCloud terminology) that does indexing is going to need one of these >> ram buffers. >> > > I will definitely investigate this ramBufferSizeMB. > And, see through lucene code when a segment is flushed. > > Again, many thanks. > Patrick
Re: Too many commits
Swawn, thanks you very much for your answer. On Wed, May 2, 2018 at 6:27 PM, Shawn Heiseywrote: > On 5/2/2018 4:54 AM, Patrick Recchia wrote: > > I'm seeing way too many commits on our solr cluster, and I don't know > why. > > Are you sure there are commits happening? Do you have logs actually > saying that a commit is occurring? The creation of a new segment does > not necessarily mean a commit happened -- this can happen even without a > commit. > You're right, I assumed a new segment would be created only as part of a commit; but I realize now that there can be other situations. Is there any logging I can turn on to know when a commit happens and/or when a segment is flushed? I would be very interested in that I've already enabled InfoStream logging from the IndexWriter, but have found nothing yet there to help me understand that > > - IndexConfig is set to autoCommit every minute: > > > > ${solr.autoCommit.maxTime:6} < > > openSearcher>true > > > > (solr.autoCommit.maxTime is not set) > > It's recommended to set openSearcher to false on autoCommit. Do you > have autoSoftCommit configured? > autoSoftCommit is left at its default '-1' (which means infinity, I suppose). > > > There is nothing else customized (when it comes to IndexWriter, at least) > > within solrconfig.xml > > > > The data is sent without commit, but with commitWithin=50 ms. > > > > All that said, I would have expected a rate of about 1 segment created > epr > > minute; of about 100MB. > > One of the events that can cause a new segment to be flushed is the ram > buffer filling up. Solr defaults to a ramBufferSizeMB value of 100. > But that does not translate to a segment size of 100MB -- it's merely > the size of the ram buffer that Lucene uses for all the work related to > building a segment. A segment resulting from a full memory buffer is > going to be smaller than the buffer. I do not know how MUCH smaller, or > what causes variations in that size. > > The general advice is to leave the buffer size alone. But with the high > volume you've got, you might want to increase it so segments are not > flushed as frequently. Be aware that increasing it will have an impact > on how much heap memory gets used. Every Solr core (shard replica in > SolrCloud terminology) that does indexing is going to need one of these > ram buffers. > I will definitely investigate this ramBufferSizeMB. And, see through lucene code when a segment is flushed. Again, many thanks. Patrick
Re: Shard size variation
You can always increase the maximum segment size. For large indexes that should reduce the number of segments. But watch your indexing stats, I can't predict the consequences of bumping it to 100G for instance. I'd _expect_ bursty I/O whne those large segments started to be created or merged You'll be interested in LUCENE-7976 (Solr 7.4?), especially (probably) the idea of increasing the segment sizes and/or a related JIRA that allows you to tweak how aggressively solr merges segments that have deleted docs. NOTE: that JIRA has the consequence that _by default_ the optimize with no parameters respects the maximum segment size, which is a change from now. Finally, expungeDeletes may be useful as that too will respect max segment size, again after LUCENE-7976 is committed. Best, Erick On Wed, May 2, 2018 at 9:22 AM, Michael Joynerwrote: > The main reason we go this route is that after awhile (with default > settings) we end up with hundreds of shards and performance of course drops > abysmally as a result. By using a stepped optimize a) we don't run into the > we need the 3x+ head room issue, b) optimize performance penalty during > optimize is less than the hundreds of shards not being optimized performance > penalty. > > BTW, as we use batched a batch insert/update cycle [once daily] we only do > optimize to a segment of 1 after a complete batch has been run. Though > during the batch we reduce segment counts down to a max of 16 every 250K > insert/updates to prevent the large segment count performance penalty. > > > On 04/30/2018 07:10 PM, Erick Erickson wrote: >> >> There's really no good way to purge deleted documents from the index >> other than to wait until merging happens. >> >> Optimize/forceMerge and expungeDeletes both suffer from the problem >> that they create massive segments that then stick around for a very >> long time, see: >> >> https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ >> >> Best, >> Erick >> >> On Mon, Apr 30, 2018 at 1:56 PM, Michael Joyner >> wrote: >>> >>> Based on experience, 2x head room is room is not always enough, sometimes >>> not even 3x, if you are optimizing from many segments down to 1 segment >>> in a >>> single go. >>> >>> We have however figured out a way that can work with as little as 51% >>> free >>> space via the following iteration cycle: >>> >>> public void solrOptimize() { >>> int initialMaxSegments = 256; >>> int finalMaxSegments = 1; >>> if (isShowSegmentCounter()) { >>> log.info("Optimizing ..."); >>> } >>> try (SolrClient solrServerInstance = getSolrClientInstance()){ >>> for (int segments=initialMaxSegments; >>> segments>=finalMaxSegments; segments--) { >>> if (isShowSegmentCounter()) { >>> System.out.println("Optimizing to a max of >>> "+segments+" >>> segments."); >>> } >>> solrServerInstance.optimize(true, true, segments); >>> } >>> } catch (SolrServerException | IOException e) { >>> throw new RuntimeException(e); >>> >>> } >>> } >>> >>> >>> On 04/30/2018 04:23 PM, Walter Underwood wrote: You need 2X the minimum index size in disk space anyway, so don’t worry about keeping the indexes as small as possible. Worry about having enough headroom. If your indexes are 250 GB, you need 250 GB of free space. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Apr 30, 2018, at 1:13 PM, Antony A wrote: > > Thanks Erick/Deepak. > > The cloud is running on baremetal (128 GB/24 cpu). > > Is there an option to run a compact on the data files to make the size > equal on both the clouds? I am trying find all the options before I add > the > new fields into the production cloud. > > Thanks > AA > > On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson > > wrote: > >> Anthony: >> >> You are probably seeing the results of removing deleted documents from >> the shards as they're merged. Even on replicas in the same _shard_, >> the size of the index on disk won't necessarily be identical. This has >> to do with which segments are selected for merging, which are not >> necessarily coordinated across replicas. >> >> The test is if the number of docs on each collection is the same. If >> it is, then don't worry about index sizes. >> >> Best, >> Erick >> >> On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel >> wrote: >>> >>> Could you please also give the machine details of the two clouds you >>> are >>> running? >>> >>> >>> >>> Deepak >>> "The greatness of a
Re: Indexing throughput
I've seen 1.5 M docs/second. Basically the indexing throughput is gated by two things: 1> the number of shards. Indexing throughput essentially scales up reasonably linearly with the number of shards. 2> the indexing program that pushes data to Solr. Before thinking Solr is the bottleneck, check how fast your ETL process is pushing docs. This pre-supposes using SolrJ and CloudSolrClient for the final push to Solr. This pre-buckets the updates and sends the updates for each shard to the shard leader, thus reducing the amount of work Solr has to do. If you use SolrJ, you can easily do <2> above by just commenting out the single call that pushes the docs to Solr in your program. Speaking of which, it's definitely best to batch the updates, see: https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/ Best, Erick On Wed, May 2, 2018 at 10:07 AM, Walter Underwoodwrote: > We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb RAM > each > (EC2 C4.8xlarge). The collection is 24 million documents with four shards. > The cluster > is Solr 6.6.2. All storage is SSD EBS. > > We built a simple batch loader in Java. We get about one million documents > per minute > with 64 threads. We do not use the cloud-smart SolrJ client. We just send all > the > batches to the load balancer and let Solr sort it out. > > You are looking for 3 million documents per minute. You will just have to > test that. > > I haven’t tested it, but indexing should speed up linearly with the number of > shards, > because those are indexing in parallel. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > >> On May 2, 2018, at 9:58 AM, Greenhorn Techie >> wrote: >> >> Hi, >> >> The current hardware profile for our production cluster is 20 nodes, each >> with 24cores and 256GB memory. Data being indexed is very structured in >> nature and is about 30 columns or so, out of which half of them are >> categorical with a defined list of values. The expected peak indexing >> throughput is to be about *5* documents per second (expected to be done >> at off-peak hours so that search requests will be minimal during this time) >> and the average throughput around *1* documents (normal business >> hours). >> >> Given the hardware profile, is it realistic and practical to achieve the >> desired throughput? What factors affect the performance of indexing apart >> from the above hardware characteristics? I understand that its very >> difficult to provide any guidance unless a prototype is done. But wondering >> what are the considerations and dependencies we need to be aware of and >> whether our throughput expectations are realistic or not. >> >> Thanks >
Re: Solr Heap usage
On 5/1/2018 5:33 PM, Greenhorn Techie wrote: > Wondering what are the considerations to be aware to arrive at an optimal > heap size for Solr JVM? Though I did discuss this on the IRC, I am still > unclear on how Solr uses the JVM heap space. Are there any pointers to > understand this aspect better? I'm one of the people you've been chatting with on IRC. I also wrote the wiki page that Susheel has recommended to you. > Given that Solr requires an optimally configured heap, so that the > remaining unused memory can be used for OS disk cache, I wonder how to best > configure Solr heap. Also, on the IRC it was discussed that having 31GB of > heap is better than having 32GB due to Java’s internal usage of heap. Can > anyone guide further on heap configuration please? With the index size you mentioned on IRC, it's very difficult to project how much heap you're going to need. Actually setting up a system, putting data on it, and firing real queries at it may be the only way to be sure. The only concrete advice I can give you with the information available is this: Install as much memory as you can. It is extremely unlikely that you would ever have too much memory when you're dealing with terabyte-scale indexes. Heavy indexing (which you have mentioned as a requirement in another thread) will tend to require a larger heap. Thanks, Shawn
Re: Indexing throughput
We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb RAM each (EC2 C4.8xlarge). The collection is 24 million documents with four shards. The cluster is Solr 6.6.2. All storage is SSD EBS. We built a simple batch loader in Java. We get about one million documents per minute with 64 threads. We do not use the cloud-smart SolrJ client. We just send all the batches to the load balancer and let Solr sort it out. You are looking for 3 million documents per minute. You will just have to test that. I haven’t tested it, but indexing should speed up linearly with the number of shards, because those are indexing in parallel. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On May 2, 2018, at 9:58 AM, Greenhorn Techie> wrote: > > Hi, > > The current hardware profile for our production cluster is 20 nodes, each > with 24cores and 256GB memory. Data being indexed is very structured in > nature and is about 30 columns or so, out of which half of them are > categorical with a defined list of values. The expected peak indexing > throughput is to be about *5* documents per second (expected to be done > at off-peak hours so that search requests will be minimal during this time) > and the average throughput around *1* documents (normal business > hours). > > Given the hardware profile, is it realistic and practical to achieve the > desired throughput? What factors affect the performance of indexing apart > from the above hardware characteristics? I understand that its very > difficult to provide any guidance unless a prototype is done. But wondering > what are the considerations and dependencies we need to be aware of and > whether our throughput expectations are realistic or not. > > Thanks
Indexing throughput
Hi, The current hardware profile for our production cluster is 20 nodes, each with 24cores and 256GB memory. Data being indexed is very structured in nature and is about 30 columns or so, out of which half of them are categorical with a defined list of values. The expected peak indexing throughput is to be about *5* documents per second (expected to be done at off-peak hours so that search requests will be minimal during this time) and the average throughput around *1* documents (normal business hours). Given the hardware profile, is it realistic and practical to achieve the desired throughput? What factors affect the performance of indexing apart from the above hardware characteristics? I understand that its very difficult to provide any guidance unless a prototype is done. But wondering what are the considerations and dependencies we need to be aware of and whether our throughput expectations are realistic or not. Thanks
Re: Learning to Rank (LTR) with grouping
Figured out that offset is used as part of the grouping patch which I applied (SOLR-8776) : solr/core/src/java/org/apache/solr/handler/component/QueryComponent.java + if (query instanceof AbstractReRankQuery){ +topNGroups = cmd.getOffset() + ((AbstractReRankQuery)query).getReRankDocs(); + } else { +topNGroups = cmd.getOffset() + cmd.getLen(); - --Ilay -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Too many commits
On 5/2/2018 4:54 AM, Patrick Recchia wrote: > I'm seeing way too many commits on our solr cluster, and I don't know why. Are you sure there are commits happening? Do you have logs actually saying that a commit is occurring? The creation of a new segment does not necessarily mean a commit happened -- this can happen even without a commit. > - IndexConfig is set to autoCommit every minute: > > ${solr.autoCommit.maxTime:6} < > openSearcher>true > > (solr.autoCommit.maxTime is not set) It's recommended to set openSearcher to false on autoCommit. Do you have autoSoftCommit configured? > There is nothing else customized (when it comes to IndexWriter, at least) > within solrconfig.xml > > The data is sent without commit, but with commitWithin=50 ms. > > All that said, I would have expected a rate of about 1 segment created epr > minute; of about 100MB. One of the events that can cause a new segment to be flushed is the ram buffer filling up. Solr defaults to a ramBufferSizeMB value of 100. But that does not translate to a segment size of 100MB -- it's merely the size of the ram buffer that Lucene uses for all the work related to building a segment. A segment resulting from a full memory buffer is going to be smaller than the buffer. I do not know how MUCH smaller, or what causes variations in that size. The general advice is to leave the buffer size alone. But with the high volume you've got, you might want to increase it so segments are not flushed as frequently. Be aware that increasing it will have an impact on how much heap memory gets used. Every Solr core (shard replica in SolrCloud terminology) that does indexing is going to need one of these ram buffers. Thanks, Shawn
Re: Shard size variation
The main reason we go this route is that after awhile (with default settings) we end up with hundreds of shards and performance of course drops abysmally as a result. By using a stepped optimize a) we don't run into the we need the 3x+ head room issue, b) optimize performance penalty during optimize is less than the hundreds of shards not being optimized performance penalty. BTW, as we use batched a batch insert/update cycle [once daily] we only do optimize to a segment of 1 after a complete batch has been run. Though during the batch we reduce segment counts down to a max of 16 every 250K insert/updates to prevent the large segment count performance penalty. On 04/30/2018 07:10 PM, Erick Erickson wrote: There's really no good way to purge deleted documents from the index other than to wait until merging happens. Optimize/forceMerge and expungeDeletes both suffer from the problem that they create massive segments that then stick around for a very long time, see: https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/ Best, Erick On Mon, Apr 30, 2018 at 1:56 PM, Michael Joynerwrote: Based on experience, 2x head room is room is not always enough, sometimes not even 3x, if you are optimizing from many segments down to 1 segment in a single go. We have however figured out a way that can work with as little as 51% free space via the following iteration cycle: public void solrOptimize() { int initialMaxSegments = 256; int finalMaxSegments = 1; if (isShowSegmentCounter()) { log.info("Optimizing ..."); } try (SolrClient solrServerInstance = getSolrClientInstance()){ for (int segments=initialMaxSegments; segments>=finalMaxSegments; segments--) { if (isShowSegmentCounter()) { System.out.println("Optimizing to a max of "+segments+" segments."); } solrServerInstance.optimize(true, true, segments); } } catch (SolrServerException | IOException e) { throw new RuntimeException(e); } } On 04/30/2018 04:23 PM, Walter Underwood wrote: You need 2X the minimum index size in disk space anyway, so don’t worry about keeping the indexes as small as possible. Worry about having enough headroom. If your indexes are 250 GB, you need 250 GB of free space. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) On Apr 30, 2018, at 1:13 PM, Antony A wrote: Thanks Erick/Deepak. The cloud is running on baremetal (128 GB/24 cpu). Is there an option to run a compact on the data files to make the size equal on both the clouds? I am trying find all the options before I add the new fields into the production cloud. Thanks AA On Mon, Apr 30, 2018 at 10:45 AM, Erick Erickson wrote: Anthony: You are probably seeing the results of removing deleted documents from the shards as they're merged. Even on replicas in the same _shard_, the size of the index on disk won't necessarily be identical. This has to do with which segments are selected for merging, which are not necessarily coordinated across replicas. The test is if the number of docs on each collection is the same. If it is, then don't worry about index sizes. Best, Erick On Mon, Apr 30, 2018 at 9:38 AM, Deepak Goel wrote: Could you please also give the machine details of the two clouds you are running? Deepak "The greatness of a nation can be judged by the way its animals are treated. Please stop cruelty to Animals, become a Vegan" +91 73500 12833 deic...@gmail.com Facebook: https://www.facebook.com/deicool LinkedIn: www.linkedin.com/in/deicool "Plant a Tree, Go Green" Make In India : http://www.makeinindia.com/home On Mon, Apr 30, 2018 at 9:51 PM, Antony A wrote: Hi Shawn, The cloud is running version 6.2.1. with ClassicIndexSchemaFactory The sum of size from admin UI on all the shards is around 265 G vs 224 G between the two clouds. I created the collection using "numShards" so compositeId router. If you need more information, please let me know. Thanks AA On Mon, Apr 30, 2018 at 10:04 AM, Shawn Heisey wrote: On 4/30/2018 9:51 AM, Antony A wrote: I am running two separate solr clouds. I have 8 shards in each with a total of 300 million documents. Both the clouds are indexing the document from the same source/configuration. I am noticing there is a difference in the size of the collection between them. I am planning to add more shards to see if that helps solve the issue. Has anyone come across similar issue? There's no information here about exactly what you are seeing, what you are expecting to see, and why you believe that what you are seeing is wrong. You did say that there is "a difference in size". That is a very vague
RE: Collection reload leaves dangling SolrCore instances
Sounds just like it, i will check it out! Thanks both! Markus -Original message- > From:Erick Erickson> Sent: Wednesday 2nd May 2018 17:21 > To: solr-user > Subject: Re: Collection reload leaves dangling SolrCore instances > > Markus: > > You may well be hitting SOLR-11882. > > On Wed, May 2, 2018 at 8:18 AM, Shawn Heisey wrote: > > On 5/2/2018 4:40 AM, Markus Jelsma wrote: > >> One of our collections, that is heavy with tons of TokenFilters using > >> large dictionaries, has a lot of trouble dealing with collection reload. I > >> removed all custom plugins from solrconfig, dumbed the schema down and > >> removed all custom filters and replaced a customized decompounder with > >> Lucene's vanilla filter, and the problem still exists. > >> > >> After collection reload a second SolrCore instance appears for each real > >> core in use, each next reload causes the number of instances to grow. The > >> dangling instances are eventually removed except for one or two. When > >> working locally with for example two shards/one replica in one JVM, a > >> single reload eats about 500 MB for each reload. > >> > >> How can we force Solr to remove those instances sooner? Forcing a GC won't > >> do it so it seems Solr itself actively keeps some stale instances alive. > > > > Custom plugins, which you did mention, would be the most likely > > culprit. Those sometimes have bugs where they don't properly close > > resources. Are you absolutely sure that there is no custom software > > loading at all? Removing the jars entirely (not just the config that > > might use the jars) might be required. > > > > Have you been able to get heap dumps and figure out what object is > > keeping the SolrCore alive? > > > > Thanks, > > Shawn > > >
Re: SolrCloud replicaition
That's a pretty open-ended question. The short form is when the replica switches back to "active" (or green on the admin UI) then it's been caught up. This is all about NRT replicas. PULL and TLOG replicas pull the segments from the leader so the idea of "sending a doc to the replica" doesn't really apply. Well, TLOG replicas get a copy of the doc for _their_ tlogs but there is no active indexing going on. Best, Erick On Wed, May 2, 2018 at 8:43 AM, kumar gauravwrote: > Hi Erick > > What will happen after replica recovered ? Is leader continuously > checks status of replica and send again after recovered or replica will > pull document for indexing after recovering ? > > Please clarify this behavior for all of Replica types i.e. NRT, TLOG and > PULL. (i have implemented solr 7.3 ) > > Thanks . > Kumar Gaurav > > > On Wed, May 2, 2018 at 9:04 PM, Erick Erickson > wrote: > >> 1> When the replica fails, the leader tries to resend it, and if the >> resends fail, >> then the follower goes into recovery which will eventually get the >> document >> caught up. >> >> 2> Yes, the client will get a failure indication. >> >> Best, >> Erick >> >> On Wed, May 2, 2018 at 3:03 AM, Greenhorn Techie >> wrote: >> > Hi, >> > >> > Good Morning!! >> > >> > In the case of a SolrCloud setup with sharing and replication in place, >> > when a document is sent for indexing, what happens when only the shard >> > leader has indexed the document, but the replicas failed, for whatever >> > reason. Will the document be resent by the leader to the replica shards >> to >> > index the document after sometime or how is scenario addressed? >> > >> > Also, given the above context, when I set the value of min_rf parameter >> to >> > say 2, does that mean the calling application will be informed that the >> > indexing failed? >>
Re: SolrCloud replicaition
Hi Erick What will happen after replica recovered ? Is leader continuously checks status of replica and send again after recovered or replica will pull document for indexing after recovering ? Please clarify this behavior for all of Replica types i.e. NRT, TLOG and PULL. (i have implemented solr 7.3 ) Thanks . Kumar Gaurav On Wed, May 2, 2018 at 9:04 PM, Erick Ericksonwrote: > 1> When the replica fails, the leader tries to resend it, and if the > resends fail, > then the follower goes into recovery which will eventually get the > document > caught up. > > 2> Yes, the client will get a failure indication. > > Best, > Erick > > On Wed, May 2, 2018 at 3:03 AM, Greenhorn Techie > wrote: > > Hi, > > > > Good Morning!! > > > > In the case of a SolrCloud setup with sharing and replication in place, > > when a document is sent for indexing, what happens when only the shard > > leader has indexed the document, but the replicas failed, for whatever > > reason. Will the document be resent by the leader to the replica shards > to > > index the document after sometime or how is scenario addressed? > > > > Also, given the above context, when I set the value of min_rf parameter > to > > say 2, does that mean the calling application will be informed that the > > indexing failed? >
Re: SolrCloud replicaition
1> When the replica fails, the leader tries to resend it, and if the resends fail, then the follower goes into recovery which will eventually get the document caught up. 2> Yes, the client will get a failure indication. Best, Erick On Wed, May 2, 2018 at 3:03 AM, Greenhorn Techiewrote: > Hi, > > Good Morning!! > > In the case of a SolrCloud setup with sharing and replication in place, > when a document is sent for indexing, what happens when only the shard > leader has indexed the document, but the replicas failed, for whatever > reason. Will the document be resent by the leader to the replica shards to > index the document after sometime or how is scenario addressed? > > Also, given the above context, when I set the value of min_rf parameter to > say 2, does that mean the calling application will be informed that the > indexing failed?
Re: Too many commits
Two possibilities: 1> you have multiple replicas in the same JVM and are seeing commits happen withall of them. 2> ramBufferSizeMB. when you index docs, segments are flushed when the in-memory structures exceed this limit, is this perhaps what you're seeing? Best, Erick On Wed, May 2, 2018 at 3:54 AM, Patrick Recchiawrote: > Hello, > > I'm seeing way too many commits on our solr cluster, and I don't know why. > > Here is the landscape: > - Each collection we create (one per day) is created with 10 shards with 2 > replicas each. > - we send live data, 2B records / day. so on average 200M records/shard per > day - for a size of approx 180GB/sahrd*Day. > on peak hours that makes approx 10M records/hour; > - so approx. 15 records/minute. For a size of ~115MB/Minute? > > - IndexConfig is set to autoCommit every minute: > > ${solr.autoCommit.maxTime:6} < > openSearcher>true > > (solr.autoCommit.maxTime is not set) > > There is nothing else customized (when it comes to IndexWriter, at least) > within solrconfig.xml > > The data is sent without commit, but with commitWithin=50 ms. > > All that said, I would have expected a rate of about 1 segment created epr > minute; of about 100MB. > > Instead of that, I a lot of very small segments (between a few KB to a few > MB) with a very high rate. > > And I have no idea why this would happen. > Where I can look to explain such a rate of segments being written? > > > > > > -- > One way of describing a computer is as an electric box which hums. > Never ascribe to malice what can be explained by stupidity > -- > Patrick Recchia > GSM (BE): +32 486 828311 > GSM(IT): +39 347 2300830
Re: Collection reload leaves dangling SolrCore instances
Markus: You may well be hitting SOLR-11882. On Wed, May 2, 2018 at 8:18 AM, Shawn Heiseywrote: > On 5/2/2018 4:40 AM, Markus Jelsma wrote: >> One of our collections, that is heavy with tons of TokenFilters using large >> dictionaries, has a lot of trouble dealing with collection reload. I removed >> all custom plugins from solrconfig, dumbed the schema down and removed all >> custom filters and replaced a customized decompounder with Lucene's vanilla >> filter, and the problem still exists. >> >> After collection reload a second SolrCore instance appears for each real >> core in use, each next reload causes the number of instances to grow. The >> dangling instances are eventually removed except for one or two. When >> working locally with for example two shards/one replica in one JVM, a single >> reload eats about 500 MB for each reload. >> >> How can we force Solr to remove those instances sooner? Forcing a GC won't >> do it so it seems Solr itself actively keeps some stale instances alive. > > Custom plugins, which you did mention, would be the most likely > culprit. Those sometimes have bugs where they don't properly close > resources. Are you absolutely sure that there is no custom software > loading at all? Removing the jars entirely (not just the config that > might use the jars) might be required. > > Have you been able to get heap dumps and figure out what object is > keeping the SolrCore alive? > > Thanks, > Shawn >
Re: SorCloud Sharding
1> You have to prototype, see: https://lucidworks.com/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ 2> No. It could be done, but it'd take some very careful work. Basically you'd have to merge "adjacent" shards where "adjacent" is measured by the shard range of each replica, then fiddle with the state.json file and hope you get it all right. I'm not sure whether the new autoscaling stuff will handle this or not. 3> Yep. But why bother reducing the number of shards? Agreed, there's a little overhead in having more shards than you need, but you can host multiple replicas in the same JVM so as long as you get satisfactory performance, there's no particularly good reason to merge them that pops to mind. Best, Erick On Wed, May 2, 2018 at 6:22 AM, Greenhorn Techiewrote: > Hi, > > I have few questions on sharding in a SolrCloud setup: > > 1. How to know the optimal number of shards required for a SolrCloud setup? > What are the factors to consider to decide on the value for *numShards* > parameter? > 2. In case if over sharding has been done i.e. if numShards has been set to > a very high value, is there a mechanism to merge multiple shards in a > SolrCloud setup? > 3. In case if no such merge mechanism is available, is reindexing the only > option to set numShards to a new lower value? > > Thnx.
Re: Collection reload leaves dangling SolrCore instances
On 5/2/2018 4:40 AM, Markus Jelsma wrote: > One of our collections, that is heavy with tons of TokenFilters using large > dictionaries, has a lot of trouble dealing with collection reload. I removed > all custom plugins from solrconfig, dumbed the schema down and removed all > custom filters and replaced a customized decompounder with Lucene's vanilla > filter, and the problem still exists. > > After collection reload a second SolrCore instance appears for each real core > in use, each next reload causes the number of instances to grow. The dangling > instances are eventually removed except for one or two. When working locally > with for example two shards/one replica in one JVM, a single reload eats > about 500 MB for each reload. > > How can we force Solr to remove those instances sooner? Forcing a GC won't do > it so it seems Solr itself actively keeps some stale instances alive. Custom plugins, which you did mention, would be the most likely culprit. Those sometimes have bugs where they don't properly close resources. Are you absolutely sure that there is no custom software loading at all? Removing the jars entirely (not just the config that might use the jars) might be required. Have you been able to get heap dumps and figure out what object is keeping the SolrCore alive? Thanks, Shawn
Re: count mismatch: number of records indexed
And if you _do_ have a uniqueKey ("id" by default), subsequent records will overwrite older records with the same key. The tip from Annameneni is the first thing I'd try though, make sure you've issued a commit. Best, Erick On Wed, May 2, 2018 at 7:09 AM, ANNAMANENI RAVEENDRAwrote: > Possible cases can be > > If you don’t have unique key then there are high chances that you will see > less data > Try hard commit or check your commit times (hard/soft) > > > On Wed, May 2, 2018 at 9:30 AM Srinivas Kashyap < > srini...@tradestonesoftware.com> wrote: > >> Hi, >> >> I have standalone solr index server 5.2.1 and have a core with 15 >> fields(all indexed and stored). >> >> Through DIH I'm indexing the data (around 65million records). The index >> process took 6hours to complete. But after the completion when I checked >> through Solr admin query console(*:*), numfound is only 41 thousand >> records. Am I missing some configuration to index all records? >> >> Physical memory: 16GB >> JVM memory: 4GB >> >> Thanks, >> Srinivas >>
Re: Query regarding solr 7.3.0
Just what it says. Solr/Lucene like lots of file handles, I regularly see several thousand. If you run out of file handles Solr stops working. Ditto processes. Solr in particular spawns a lot of threads, particularly when handling many incoming requests through Jetty. If you exceed the limit, requests fail. The comments about solr.in.sh are only if you want to stop the warning. To really fix the underlying issue you need to talk to your system administrator and up the ulimit values. This is a system-level operation, not something configured in Solr. Best, Erick On Wed, May 2, 2018 at 4:52 AM, Agarwal, Monica (Nokia - IN/Bangalore)wrote: > Hi , > > I am trying to upgrade solr from 7.1.0 to 7.3.0 . > > While trying to start the solr process the below warnings are observed: > > *** [WARN] *** Your open file limit is currently 1024. > It should be set to 65000 to avoid operational disruption. > If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false > in your profile or solr.in.sh > *** [WARN] *** Your Max Limit is currently 1024. > It should be set to 65000 to avoid operational disruption. > If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false > in your profile or solr.in.sh > > Could anyone of you help me in understanding these warnings if it could lead > to some issues. > Also if I need to do any configuration changes in solr.in.sh file. > > Regards, > Monica > >
Query regarding solr 7.3.0
Hi , I am trying to upgrade solr from 7.1.0 to 7.3.0 . While trying to start the solr process the below warnings are observed: *** [WARN] *** Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption. If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh *** [WARN] *** Your Max Limit is currently 1024. It should be set to 65000 to avoid operational disruption. If you no longer wish to see this warning, set SOLR_ULIMIT_CHECKS to false in your profile or solr.in.sh Could anyone of you help me in understanding these warnings if it could lead to some issues. Also if I need to do any configuration changes in solr.in.sh file. Regards, Monica
Re: Solr working £ Symbol
On 5/2/2018 3:13 AM, Mohan Cheema wrote: > We are using Solr to index our data. The data contains £ symbol within the > text and for currency. When data is exported from the source system data > contains £ symbol, however, when the data is imported into the Solr £ symbol > is converted to �. > > How can we keep the £ symbol as is when importing data? What tools are you using to look at Solr results? What tools are you using to send update data to Solr? Solr expects and delivers UTF-8 characters. If the data you're sending to Solr is using another character set, Java may not interpret it correctly. Conversely, if whatever you're using to look at Solr's results is also not expecting/displaying UTF-8, you might not be shown correct characters. Thanks, Shawn
Re: count mismatch: number of records indexed
Possible cases can be If you don’t have unique key then there are high chances that you will see less data Try hard commit or check your commit times (hard/soft) On Wed, May 2, 2018 at 9:30 AM Srinivas Kashyap < srini...@tradestonesoftware.com> wrote: > Hi, > > I have standalone solr index server 5.2.1 and have a core with 15 > fields(all indexed and stored). > > Through DIH I'm indexing the data (around 65million records). The index > process took 6hours to complete. But after the completion when I checked > through Solr admin query console(*:*), numfound is only 41 thousand > records. Am I missing some configuration to index all records? > > Physical memory: 16GB > JVM memory: 4GB > > Thanks, > Srinivas >
Is it normal for BlendedInfixLookupFactory to not show terms?
BlendedInfixLookupFactory is not returning terms, but returns the field value. If I change to FuzzyLookupFactory it works fine. Am I doing something wrong? default BlendedInfixLookupFactory position_linear DocumentDictionaryFactory weight text_suggest language textSuggest true -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
count mismatch: number of records indexed
Hi, I have standalone solr index server 5.2.1 and have a core with 15 fields(all indexed and stored). Through DIH I'm indexing the data (around 65million records). The index process took 6hours to complete. But after the completion when I checked through Solr admin query console(*:*), numfound is only 41 thousand records. Am I missing some configuration to index all records? Physical memory: 16GB JVM memory: 4GB Thanks, Srinivas
SorCloud Sharding
Hi, I have few questions on sharding in a SolrCloud setup: 1. How to know the optimal number of shards required for a SolrCloud setup? What are the factors to consider to decide on the value for *numShards* parameter? 2. In case if over sharding has been done i.e. if numShards has been set to a very high value, is there a mechanism to merge multiple shards in a SolrCloud setup? 3. In case if no such merge mechanism is available, is reindexing the only option to set numShards to a new lower value? Thnx.
Re: Query Regarding Solr Garbage Collection
A very high rate of indexing documents could cause heap usage to go high (all temporary objects getting created are in JVM memory and with very high rate heap utilization may go high) Having Cache's not sized/set correctly would also return in high JVM usage since as searches are happening, it will keep filling cache's thus JVM. Other factors like sorting/faceting etc. would also require JVM memory and deep paging could even cause JVM to run out of memory/OOM. Thnx On Tue, May 1, 2018 at 6:18 PM, Greenhorn Techiewrote: > Hi, > > Following the https://wiki.apache.org/solr/SolrPerformanceFactors article, > I understand that Garbage Collection might be triggered due to significant > increase in JVM heap usage unless a commit is performed. Given this > background, I am curious to understand the reasons / factors that > contribute to increased heap usage of Solr JVM, which would thus force a > Garbage Collection cycle. > > Especially, what are the factors that contribute to heap usage increase > during indexing time and what factors contribute during search/query time? > > Thanks >
Re: Solr Heap usage
Take a look at https://wiki.apache.org/solr/SolrPerformanceProblems. The section "how much heap do i need" talks about that. Cache also goes to JVM so take a look how much you need/allocating for different cache's. Thnx On Tue, May 1, 2018 at 7:33 PM, Greenhorn Techiewrote: > Hi, > > Wondering what are the considerations to be aware to arrive at an optimal > heap size for Solr JVM? Though I did discuss this on the IRC, I am still > unclear on how Solr uses the JVM heap space. Are there any pointers to > understand this aspect better? > > Given that Solr requires an optimally configured heap, so that the > remaining unused memory can be used for OS disk cache, I wonder how to best > configure Solr heap. Also, on the IRC it was discussed that having 31GB of > heap is better than having 32GB due to Java’s internal usage of heap. Can > anyone guide further on heap configuration please? > > Thanks >
Autocomplete returning shingles
I need to use autocomplete with edismax (ngrams,edgegrams) to return shingled suggestions. Field value "new york city" needs to return on query "ne" -> "new","new york","new york city". With suggester this is easy. But im forced to use edismax because I need to apply mutliple filter queries. What is best approach to deal with this? Any suggestions are appreciated. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Too many commits
Hello, I'm seeing way too many commits on our solr cluster, and I don't know why. Here is the landscape: - Each collection we create (one per day) is created with 10 shards with 2 replicas each. - we send live data, 2B records / day. so on average 200M records/shard per day - for a size of approx 180GB/sahrd*Day. on peak hours that makes approx 10M records/hour; - so approx. 15 records/minute. For a size of ~115MB/Minute? - IndexConfig is set to autoCommit every minute: ${solr.autoCommit.maxTime:6} < openSearcher>true (solr.autoCommit.maxTime is not set) There is nothing else customized (when it comes to IndexWriter, at least) within solrconfig.xml The data is sent without commit, but with commitWithin=50 ms. All that said, I would have expected a rate of about 1 segment created epr minute; of about 100MB. Instead of that, I a lot of very small segments (between a few KB to a few MB) with a very high rate. And I have no idea why this would happen. Where I can look to explain such a rate of segments being written? -- One way of describing a computer is as an electric box which hums. Never ascribe to malice what can be explained by stupidity -- Patrick Recchia GSM (BE): +32 486 828311 GSM(IT): +39 347 2300830
Collection reload leaves dangling SolrCore instances
Hello, One of our collections, that is heavy with tons of TokenFilters using large dictionaries, has a lot of trouble dealing with collection reload. I removed all custom plugins from solrconfig, dumbed the schema down and removed all custom filters and replaced a customized decompounder with Lucene's vanilla filter, and the problem still exists. After collection reload a second SolrCore instance appears for each real core in use, each next reload causes the number of instances to grow. The dangling instances are eventually removed except for one or two. When working locally with for example two shards/one replica in one JVM, a single reload eats about 500 MB for each reload. How can we force Solr to remove those instances sooner? Forcing a GC won't do it so it seems Solr itself actively keeps some stale instances alive. Many thanks, Markus
SolrCloud replicaition
Hi, Good Morning!! In the case of a SolrCloud setup with sharing and replication in place, when a document is sent for indexing, what happens when only the shard leader has indexed the document, but the replicas failed, for whatever reason. Will the document be resent by the leader to the replica shards to index the document after sometime or how is scenario addressed? Also, given the above context, when I set the value of min_rf parameter to say 2, does that mean the calling application will be informed that the indexing failed?
Solr working £ Symbol
Hi There, We are using Solr to index our data. The data contains £ symbol within the text and for currency. When data is exported from the source system data contains £ symbol, however, when the data is imported into the Solr £ symbol is converted to �. How can we keep the £ symbol as is when importing data? Note: When a file is viewed using less the pound symbol is displayed as and when viewed in vi editor it shows up properly. Regards, Mohan Disclaimer: www.arrkgroup.com/EmailDisclaimer
Re: Regarding LTR feature
Hi Alessandro, Thanks for responding. Let me take a step back and tell you the problem I have been facing with this.So one of the features in my LTR model is: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}scale(query({!payload_score f=aggregated_terms func=max v=${query}}),0,100)" } } so now with this feature if i apply FQ in solr it will scale the values for all the documents irrespective of the FQ filter. But if I change the feature to something like this: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}scale(query({!field f=aggregated_terms v=${query}}),0,100)" } } Then the it scales properly with FQ aswell. And about that verification I simply check the results returned like in Case 1 after applying the FQ filter that feature score doesn't scale to its maximum value of 100 which i think is because of the fact that it scales over all the documents and returns only the subset with the FQ filter applied. Alternatively is their any way I can scale these value during normalization time with a customized class which iterates over all the re-ranked documents only. Thanks a lot in advance. Looking forward to hearing back from you soon. Regards, Prateek
Re: Regarding LTR feature
Hi Alessandro, Thanks for responding. Let me take a step back and tell you the problem I have been facing with this.So one of the features in my LTR model is: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}scale(query({!payload_score f=aggregated_terms func=max v=${query}}),0,100)" } } so now with this feature if i apply FQ in solr it will scale the values for all the documents irrespective of the FQ filter. But if I change the feature to something like this: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}scale(query({!field f=aggregated_terms v=${query}}),0,100)" } } Then the it scales properly with FQ aswell. And about that verification I simply check the results returned like in Case 1 after applying the FQ filter that feature score doesn't scale to its maximum value of 100 which i think is because of the fact that it scales over all the documents and returns only the subset with the FQ filter applied. Alternatively is their any way I can scale these value during normalization time with a customized class which iterates over all the re-ranked documents only. Thanks a lot in advance. Looking forward to hearing back from you soon. Regards, Prateek On 2018/04/30 11:58:44, Alessandro Benedettiwrote: > Hi Prateek,> > with query and FQ Solr is expected to score a document only if that document> > is a match of all the FQ results intersected with the query results [1].> > Then re-ranking happens, so effectively, only the top K intersected> > documents will be re-ranked.> > > If you are curious about the code, this can be debugged running a variation> > of org.apache.solr.ltr.TestLTRWithFacet#testRankingSolrFacet (introducing> > filter queries ) and setting the breakpoint somewhere around :> > org/apache/solr/ltr/LTRRescorer.java:181> > > Can you elaborate how you have verified that is currently not working like> > that ?> > I am familiar with LTR code and I would be surprised to see this different> > behavior> > > [1] https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/> > > > > -> > ---> > Alessandro Benedetti> > Search Consultant, R Software Engineer, Director> > Sease Ltd. - www.sease.io> > --> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html> >
Re: Regarding LTR feature
Hi Alessandro, Thanks for responding. Let me take a step back and tell you the problem I have been facing with this.So one of the features in my LTR model is: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}scale(query({!payload_score f=aggregated_terms func=max v=${query}}),0,100)" } } so now with this feature if i apply FQ in solr it will scale the values for all the documents irrespective of the FQ filter. But if I change the feature to something like this: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}scale(query({!field f=aggregated_terms v=${query}}),0,100)" } } Then the it scales properly with FQ aswell. And about that verification I simply check the results returned like in Case 1 after applying the FQ filter that feature score doesn't scale to its maximum value of 100 which i think is because of the fact that it scales over all the documents and returns only the subset with the FQ filter applied. Alternatively is their any way I can scale these value during normalization time with a customized class which iterates over all the re-ranked documents only. Thanks a lot in advance. Looking forward to hearing back from you soon. Regards, Prateek On 2018/04/30 11:58:44, Alessandro Benedettiwrote: > Hi Prateek,> > with query and FQ Solr is expected to score a document only if that document> > is a match of all the FQ results intersected with the query results [1].> > Then re-ranking happens, so effectively, only the top K intersected> > documents will be re-ranked.> > > If you are curious about the code, this can be debugged running a variation> > of org.apache.solr.ltr.TestLTRWithFacet#testRankingSolrFacet (introducing> > filter queries ) and setting the breakpoint somewhere around :> > org/apache/solr/ltr/LTRRescorer.java:181> > > Can you elaborate how you have verified that is currently not working like> > that ?> > I am familiar with LTR code and I would be surprised to see this different> > behavior> > > [1] https://lucidworks.com/2017/11/27/caching-and-filters-and-post-filters/> > > > > -> > ---> > Alessandro Benedetti> > Search Consultant, R Software Engineer, Director> > Sease Ltd. - www.sease.io> > --> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html> >
Re: Regarding LTR feature
Hi Alessandro, Thanks for responding. Let me take a step back and tell you the problem I have been facing with this.So one of the features in my LTR model is: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}scale(query({!payload_score f=aggregated_terms func=max v=${query}}),0,100)" } } so now with this feature if i apply FQ in solr it will scale the values for all the documents irrespective of the FQ filter. But if I change the feature to something like this: { "store" : "my_feature_store", "name" : "in_aggregated_terms", "class" : "org.apache.solr.ltr.feature.SolrFeature", "params" : { "q" : "{!func}scale(query({!field f=aggregated_terms v=${query}}),0,100)" } } Then the it scales properly with FQ aswell. And about that verification I simply check the results returned like in Case 1 after applying the FQ filter that feature score doesn't scale to its maximum value of 100 which i think is because of the fact that it scales over all the documents and returns only the subset with the FQ filter applied. Alternatively is their any way I can scale these value during normalization time with a customized class which iterates over all the re-ranked documents only. Thanks a lot in advance. Looking forward to hearing back from you soon. Regards, Prateek