Solr cloud clusterstate.json update query ?
Could you clarify on the following questions, 1. Is there a way to avoid all the nodes simultaneously getting into recovery state when a bulk indexing happens ? Is there an api to disable replication on one node for a while ? 2. We recently changed the host name on nodes in solr.xml. But the old host entries still exist in the clusterstate.json marked as active state. Though live_nodes has the correct information. Who updates clusterstate.json if the node goes down in an ungraceful fashion without notifying its down state ? Thanks, Sai Sreenivas K
Alternate ways to facet spatial data
Hello all, I've just started using SOLR for spatial queries and it looks great so far. I've mostly been investigating importing a large amount of point data, indexing and searching it. I've discovered the facet.heatmap functionality, which is great - but I would like to ask if it is possible to get slightly different results from this. Essentially rather than a heatmap I would like either a polygon per cluster (might be too much computation?) or a point per cluster (centroid would be great, centre of grid would be ok), coupled with the point count. Is this currently possible using faceting, or does it seem like a workable feature I could implement? Cheers, James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- James Sewell, PostgreSQL Team Lead / Solutions Architect __ Level 2, 50 Queen St, Melbourne VIC 3000 *P *(+61) 3 8370 8000 *W* www.lisasoft.com *F *(+61) 3 8370 8099 -- -- The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
Re: Solr 5.1.0 Cloud and Zookeeper
Thank you very much for your answer. I installed ZooKeeper 3.4.6 on my Debian (Wheezy) system, and it's working well. The only problem I have is that I'm looking for some init script but I cannot find anything. I'm also trying to adapt the script in Debian's zookeeperd package, but I have some problems. Do you know some working init scripts for ZooKeeper on Debian? 2015-05-05 15:30 GMT+02:00 Mark Miller markrmil...@gmail.com: A bug fix version difference probably won't matter. It's best to use the same version everyone else uses and the one our tests use, but it's very likely 3.4.5 will work without a hitch. - Mark On Tue, May 5, 2015 at 9:09 AM shacky shack...@gmail.com wrote: Hi. I read on https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble that Solr needs to use the same ZooKeeper version it owns (at the moment 3.4.6). Debian Jessie has ZooKeeper 3.4.5 (https://packages.debian.org/jessie/zookeeper). Are you sure that this version won't work with Solr 5.1.0? Thank you very much for your help! Bye
SolrCloud indexing
Hi all, I have 3 nodes and there are 3 shards but looking at solrcloud admin I see that all the leaders are on the same node. If I understood well looking at solr documentation https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud : When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a leader. If the machine is a replica, the document is forwarded to the leader for processing. If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the document the leader for that shard, indexes the document for this shard, and forwards the index notation to itself and any replicas. So I have 3 nodes, with 3 shards and 2 replicas of each shard. http://picpaste.com/pics/Screen_Shot_2015-05-05_at_15.19.54-Xp8uztpt.1430832218.png Does it mean that all the indexing is done by the leaders in one node? If so, how do I distribute the indexing (the shard leaders) across nodes? -- Vincenzo D'Amore email: v.dam...@gmail.com skype: free.dev mobile: +39 349 8513251
Solr Wordpress - one server or two?
I'm thinking of taking SOLR for a test drive and will probably keep it if it works as I'm hoping so I'd like to get it as right as possible the first time out. I'm running Wordpress on Ubuntu with php and Mariadb 10. The server is a 7 core, 4gb, Azure VM. The database is 4gb. The data itself is mainly docs, pdfs, and app descriptions/images from iTunes and Google Play Store. I have two questions: 1. Should I put SOLR on the same server that hosts my site and db or create a second VM just for SOLR? I'm looking for speed here, mainly. If I install on a 2nd server should I use Tomcat instead of Apache2? Any advice is much appreciated!!! Rob -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Wordpress-one-server-or-two-tp4203914.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr 5.1.0 Cloud and Zookeeper
Hi. I read on https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble that Solr needs to use the same ZooKeeper version it owns (at the moment 3.4.6). Debian Jessie has ZooKeeper 3.4.5 (https://packages.debian.org/jessie/zookeeper). Are you sure that this version won't work with Solr 5.1.0? Thank you very much for your help! Bye
Re: Solr 5.1.0 Cloud and Zookeeper
A bug fix version difference probably won't matter. It's best to use the same version everyone else uses and the one our tests use, but it's very likely 3.4.5 will work without a hitch. - Mark On Tue, May 5, 2015 at 9:09 AM shacky shack...@gmail.com wrote: Hi. I read on https://cwiki.apache.org/confluence/display/solr/Setting+Up+an+External+ZooKeeper+Ensemble that Solr needs to use the same ZooKeeper version it owns (at the moment 3.4.6). Debian Jessie has ZooKeeper 3.4.5 (https://packages.debian.org/jessie/zookeeper). Are you sure that this version won't work with Solr 5.1.0? Thank you very much for your help! Bye
Re: Multiple index.timestamp directories using up disk space
Worried about data loss makes sense. If I get the way solr behaves, the new directory should only have missing/changed segments. I guess since our application is extremely write heavy, with lot of inserts and deletes, almost every segment is touched even during a short window, so it appears like for our deployment every segment is copied over when replicas get out of sync. Thanks for clarifying this behaviour from solr cloud so we can put in external steps to resolve when this situation arises. -Original Message- From: Ramkumar R. Aiyengar andyetitmo...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, May 5, 2015 4:52 am Subject: Re: Multiple index.timestamp directories using up disk space Yes, data loss is the concern. If the recovering replica is not able to retrieve the files from the leader, it at least has an older copy. Also, the entire index is not fetched from the leader, only the segments which have changed. The replica initially gets the file list from the replica, checks against what it has, and then downloads the difference -- then moves it to the main index. Note that this process can fail sometimes (say due to I/O errors, or due to a problem with the leader itself), in which case the replica drops all accumulated files from the leader, and starts from scratch. If that happens, it needs to look back at its old index again to figure out what it needs to download on the next attempt. May be with a fair number of assumptions which should usually hold good, you can still come up with a mechanism to drop existing files, but those won't hold good in case of serious issues with the cloud, you could end up losing data. That's worse than using a bit more disk space! On 4 May 2015 11:56, Rishi Easwaran rishi.easwa...@aol.com wrote: Thanks for the responses Mark and Ramkumar. The question I had was, why does Solr need 2 copies at any given time, leading to 2x disk space usage. Not sure if this information is not published anywhere, and makes HW estimation almost impossible for large scale deployment. Even if the copies are temporary, this becomes really expensive, especially when using SSD in production, when the complex size is over 400TB indexes, running 1000's of solr cloud shards. If a solr follower has decided that it needs to do replication from leader and capture full copy snapshot. Why can't it delete the old information and replicate from scratch, not requiring more disk space. Is the concern data loss (a case when both leader and follower lose data)?. Thanks, Rishi. -Original Message- From: Mark Miller markrmil...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Apr 28, 2015 10:52 am Subject: Re: Multiple index.timestamp directories using up disk space If copies of the index are not eventually cleaned up, I'd fill a JIRA to address the issue. Those directories should be removed over time. At times there will have to be a couple around at the same time and others may take a while to clean up. - Mark On Tue, Apr 28, 2015 at 3:27 AM Ramkumar R. Aiyengar andyetitmo...@gmail.com wrote: SolrCloud does need up to twice the amount of disk space as your usual index size during replication. Amongst other things, this ensures you have a full copy of the index at any point. There's no way around this, I would suggest you provision the additional disk space needed. On 20 Apr 2015 23:21, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi All, We are seeing this problem with solr 4.6 and solr 4.10.3. For some reason, solr cloud tries to recover and creates a new index directory - (ex:index.20150420181214550), while keeping the older index as is. This creates an issues where the disk space fills up and the shard never ends up recovering. Usually this requires a manual intervention of bouncing the instance and wiping the disk clean to allow for a clean recovery. Any ideas on how to prevent solr from creating multiple copies of index directory. Thanks, Rishi.
Re: Finding out optimal hash ranges for shard split
Looks like its not possible to find out the optimal hash ranges for a split before you actually split it. So the only way out is to keep splitting out the large subshards? -- View this message in context: http://lucene.472066.n3.nabble.com/Finding-out-optimal-hash-ranges-for-shard-split-tp4203609p4204045.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SolrCloud indexing
bq: Does it mean that all the indexing is done by the leaders in one node? no. The raw document is forwarded from the leader to the replica and it's indexed on all the nodes. The leader has a little bit of extra work to do routing the docs, but that's it. Shouldn't be a problem with 3 shards. bq: If so, how do I distribute the indexing (the shard leaders) across nodes? You don't really need to bother I don't think, especially if you don't see significantly higher CPU utilization on the leader. If you absolutely MUST distribute leadership, see the Collections API and the REBALANCELEADERS and BALANCESHARDUNIQUE (Solr 5.1 only) but frankly I wouldn't worry about it unless and until you had demonstrated need. Best, Erick On Tue, May 5, 2015 at 6:28 AM, Vincenzo D'Amore v.dam...@gmail.com wrote: Hi all, I have 3 nodes and there are 3 shards but looking at solrcloud admin I see that all the leaders are on the same node. If I understood well looking at solr documentation https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud : When a document is sent to a machine for indexing, the system first determines if the machine is a replica or a leader. If the machine is a replica, the document is forwarded to the leader for processing. If the machine is a leader, SolrCloud determines which shard the document should go to, forwards the document the leader for that shard, indexes the document for this shard, and forwards the index notation to itself and any replicas. So I have 3 nodes, with 3 shards and 2 replicas of each shard. http://picpaste.com/pics/Screen_Shot_2015-05-05_at_15.19.54-Xp8uztpt.1430832218.png Does it mean that all the indexing is done by the leaders in one node? If so, how do I distribute the indexing (the shard leaders) across nodes? -- Vincenzo D'Amore email: v.dam...@gmail.com skype: free.dev mobile: +39 349 8513251
Re: Solr 5.0 - uniqueKey case insensitive ?
Well, working fine may be a bit of an overstatement. That has never been officially supported, so it just happened to work in 3.6. As Chris points out, if you're using SolrCloud then this will _not_ work as routing happens early in the process, i.e. before the analysis chain gets the token so various copies of the doc will exist on different shards. Best, Erick On Mon, May 4, 2015 at 4:19 PM, Bruno Mannina bmann...@free.fr wrote: Hello Chris, yes I confirm on my SOLR3.6 it works fine since several years, and each doc added with same code is updated not added. To be more clear, I receive docs with a field name pn and it's the uniqueKey, and it always in uppercase so I must define in my schema.xml field name=id type=string multiValued=false indexed=true required=true stored=true/ field name=pn type=text_general multiValued=true indexed=true stored=false/ ... uniqueKeyid/uniqueKey ... copyField source=id dest=pn/ but the application that use solr already exists so it requests with pn field not id, i cannot change that. and in each docs I receive, there is not id field, just pn field, and i cannot also change that. so there is a problem no ? I must import a id field and request a pn field, but I have a pn field only for import... Le 05/05/2015 01:00, Chris Hostetter a écrit : : On SOLR3.6, I defined a string_ci field like this: : : fieldType name=string_ci class=solr.TextField : sortMissingLast=true omitNorms=true : analyzer : tokenizer class=solr.KeywordTokenizerFactory/ : filter class=solr.LowerCaseFilterFactory/ : /analyzer : /fieldType : : field name=pn type=string_ci multiValued=false indexed=true : required=true stored=true/ I'm really suprised that field would have worked for you (reliably) as a uniqueKey field even in Solr 3.6. the best practice for something like what you describe has always (going back to Solr 1.x) been to use a copyField to create a case insensitive copy of your uniqueKey for searching. if, for some reason, you really want case insensitve *updates* (so a doc with id foo overwrites a doc with id FOO then the only reliable way to make something like that work is to do the lowercassing in an UpdateProcessor to ensure it happens *before* the docs are distributed to the correct shard, and so the correct existing doc is overwritten (even if you aren't using solr cloud) -Hoss http://www.lucidworks.com/ --- Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce que la protection avast! Antivirus est active. http://www.avast.com
Re: Solr cloud clusterstate.json update query ?
about 1. This shouldn't be happening, so I wouldn't concentrate there first. The most common reason is that you have a short Zookeeper timeout and the replicas go into a stop-the-world garbage collection that exceeds the timeout. So the first thing to do is to see if that's happening. Here are a couple of good places to start: http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr 2 Partial answer is that ZK does a keep-alive type thing and if the solr nodes it knows about don't reply, it marks the nodes as down. Best, Erick On Tue, May 5, 2015 at 5:42 AM, Sai Sreenivas K sa...@myntra.com wrote: Could you clarify on the following questions, 1. Is there a way to avoid all the nodes simultaneously getting into recovery state when a bulk indexing happens ? Is there an api to disable replication on one node for a while ? 2. We recently changed the host name on nodes in solr.xml. But the old host entries still exist in the clusterstate.json marked as active state. Though live_nodes has the correct information. Who updates clusterstate.json if the node goes down in an ungraceful fashion without notifying its down state ? Thanks, Sai Sreenivas K
Re: SolrCloud collection properties
_What_ properties? Details matter And how do you do this now? Assuming you do this with separate conf directories, these are then just configsets in Zookeeper and you can have as many of them as you want. Problem here is that each one of them is a complete set of schema and config files, AFAIK the config set is the finest granularity that you have OOB. Best, Erick On Tue, May 5, 2015 at 6:55 AM, Markus Heiden markus.hei...@s24.com wrote: Hi, we are trying to migrate from Solr 4.10 to SolrCloud 4.10. I understood that SolrCloud uses collections as abstraction from the cores. What I am missing is a possibility to store collection-specific properties in Zookeeper. Using property.foo=bar in CREATE-URLs just sets core-specific properties which are not distributed, e.g. if I migrate a shard from one node to another. How do I define collection-specific properties (to be used in solrconfig.xml and schema.xml) which get distributed with the collection to all nodes? Why do I try that? Currently we have different cores which structure is identical, but have each having some specific properties. I would like to have a single configuration for them in Zookeeper from which I want to create different collections, which just differ in the value of some properties. Markus
Re: Solr cloud clusterstate.json update query ?
about 2 , live_nodes under zookeeper is ephemeral node (please see zookeeper ephemeral node). So, once connection from solr zkClient to zookeeper is lost, these nodes will disappear automatically. AFAIK, clusterstate.json is updated by overseer based on messages published to a queue in zookeeper by solr zkclients. In case, solr node dies ungracefully, I am not sure how this event is updated in clusterstate.json. *Can someone shed some light *on ungraceful solr shutdown and consequent status update in clusterstate. I guess there would be some ay, because all nodes in a cluster decides clusterstate based on watched clusterstate.json node. They will not be watching live_nodes for updating their state. Gopal On Wed, May 6, 2015 at 6:33 AM, Erick Erickson erickerick...@gmail.com wrote: about 1. This shouldn't be happening, so I wouldn't concentrate there first. The most common reason is that you have a short Zookeeper timeout and the replicas go into a stop-the-world garbage collection that exceeds the timeout. So the first thing to do is to see if that's happening. Here are a couple of good places to start: http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/ http://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr 2 Partial answer is that ZK does a keep-alive type thing and if the solr nodes it knows about don't reply, it marks the nodes as down. Best, Erick On Tue, May 5, 2015 at 5:42 AM, Sai Sreenivas K sa...@myntra.com wrote: Could you clarify on the following questions, 1. Is there a way to avoid all the nodes simultaneously getting into recovery state when a bulk indexing happens ? Is there an api to disable replication on one node for a while ? 2. We recently changed the host name on nodes in solr.xml. But the old host entries still exist in the clusterstate.json marked as active state. Though live_nodes has the correct information. Who updates clusterstate.json if the node goes down in an ungraceful fashion without notifying its down state ? Thanks, Sai Sreenivas K --
Re: Solr Wordpress - one server or two?
On 5/5/2015 6:11 AM, Robg50 wrote: I'm thinking of taking SOLR for a test drive and will probably keep it if it works as I'm hoping so I'd like to get it as right as possible the first time out. I'm running Wordpress on Ubuntu with php and Mariadb 10. The server is a 7 core, 4gb, Azure VM. The database is 4gb. The data itself is mainly docs, pdfs, and app descriptions/images from iTunes and Google Play Store. I have two questions: 1. Should I put SOLR on the same server that hosts my site and db or create a second VM just for SOLR? I'm looking for speed here, mainly. If I install on a 2nd server should I use Tomcat instead of Apache2? It's nearly impossible to answer your question with the information provided. We can make an educated guess if we have more info, but it would be a guess ... the only way to actually know is to prototype and try it. https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/ Also, be aware of general advice regarding memory and Solr performance. Unless your index is tiny, which probably won't be the case with a 4GB source database, 4GB of RAM may be too small for even a dedicated Solr machine. If the machine is NOT dedicated, the likelihood of 4GB being enough RAM is even smaller: http://wiki.apache.org/solr/SolrPerformanceProblems I can say something about your last question ... Solr won't run in Apache. It requires a Java servlet container. As long as I've been using it, it has shipped with an example that includes Jetty, but in 4.x and earlier versions there was a separately available war file for deployment in another app like Tomcat. The 5.x versions are shipped as a complete package with startup scripts that start Jetty. You *can* still find the .war and put it in another container like Tomcat, but we are discouraging that approach even more strongly than we did for past versions. Thanks, Shawn
SolrCloud collection properties
Hi, we are trying to migrate from Solr 4.10 to SolrCloud 4.10. I understood that SolrCloud uses collections as abstraction from the cores. What I am missing is a possibility to store collection-specific properties in Zookeeper. Using property.foo=bar in CREATE-URLs just sets core-specific properties which are not distributed, e.g. if I migrate a shard from one node to another. How do I define collection-specific properties (to be used in solrconfig.xml and schema.xml) which get distributed with the collection to all nodes? Why do I try that? Currently we have different cores which structure is identical, but have each having some specific properties. I would like to have a single configuration for them in Zookeeper from which I want to create different collections, which just differ in the value of some properties. Markus
Re: Solr Exception The remote server returned an error: (400) Bad Request.
Thanks for the answer but i don't think that's going to solve my problem.For instance if I copy this query in the chrome browserhttp://localhost:8080/solr48/person/select?q=CoreD:25I get this error.4001CoreD:25undefined field CoreD400If I use wget from linux wget http://localhost:8080/solr48/person/select?q=CoreD:25I get ERROR:400 Bad Request.Is any reason why I am not getting same error?Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Exception-The-remote-server-returned-an-error-400-Bad-Request-tp4203889p4203949.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Multiple index.timestamp directories using up disk space
On 5/5/2015 7:29 AM, Rishi Easwaran wrote: Worried about data loss makes sense. If I get the way solr behaves, the new directory should only have missing/changed segments. I guess since our application is extremely write heavy, with lot of inserts and deletes, almost every segment is touched even during a short window, so it appears like for our deployment every segment is copied over when replicas get out of sync. Once a segment is written, it is *NEVER* updated again. This aspect of Lucene indexes makes Solr replication more efficient. The ids of deleted documents are written to separate files specifically for tracking deletes. Those files are typically quite small compared to the index segments. Any new documents are inserted into new segments. When older segments are merged, the information in all of those segments is copied to a single new segment (minus documents marked as deleted), and then the old segments are erased. Optimizing replaces the entire index, and each replica of the index would be considered different, so an index recovery that happens after optimization might copy the whole thing. If you are seeing a lot of index recoveries during normal operation, chances are that your Solr servers do not have enough resources, and the resource that has the most impact on performance is memory. The amount of memory required for good Solr performance is higher than most people expect. It's a normal expectation that programs require memory to run, but Solr has an additional memory requirement that often surprises them -- the need for a significant OS disk cache: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
Solr/ Solr Cloud meetup at Aol
Hi All, Aol is hosting a meetup in Dulles VA. The topic this time is Solr/ Solr Cloud. http://www.meetup.com/Code-Brew/events/53217/ Thanks, Rishi.
Re: Multiple index.timestamp directories using up disk space
On 5/5/2015 1:15 PM, Rishi Easwaran wrote: Thanks for clarifying lucene segment behaviour. We don't trigger optimize externally, could it be internal solr optimize? Is there a setting/ knob to control when optimize occurs. Optimize never happens automatically, but *merging* does. An optimize is nothing more than a forced merge down to one segment. There is a merge policy, consulted anytime a new segment is created, that decides whether any automatic merges need to take place and what segments will be merged. That merge policy can be configured in solrconfig.xml. The behaviour we see multiple huge directories for the same core. Till we figure out what's going on, the only option we are left with it is to clean up the entire index to free up disk space, and allow a replica to sync from scratch. If multiple index directories exist after replication, there was either a problem that prevented the rename and deletion of the directories (common on Windows, less common on UNIX variants like Linux), or you're running into a bug. Unless you are performing maintenance or a machine goes down, index recovery (replication) should *not* be happening during normal operation of a SolrCloud cluster. Frequent index recoveries usually mean that there's a performance problem. Solr performs better on bare metal than on virtual machines. Thanks, Shawn
Re: Limit Results By Score?
: We have implemented a custom scoring function and also need to limit the : results by score. How could we go about that? Alternatively, can we : suppress the results early using some kind of custom filter? in general, limiting by score is a bad idea for all of the reasons outlined here... https://wiki.apache.org/lucene-java/ScoresAsPercentages ...if you have defined a custom scoring function, then many of those issues may not apply, and you can use the frange parser to filter documents which do not have a score in a given range... https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser How exactly you use frange with your custom score depends on how you implement it -- if ou implemented it directly as a ValueSource (w/ a ValueSourceParser) then you can just call that function directly. If you've implemented it as a custom similarity on regular query structures, you can still use frange and just wrap your query using the query() function. Eithe way, you can use the frange parser as part of a filter query to limit the results based on the score of your function -- independent of what your main query / sort are. if, in the later case, you want to match sort documents based on the same query, you can still do that using local params to refer to the query in both places... q=your_custom_querysort=score descfq={!frange l=90}query($q) -Hoss http://www.lucidworks.com/
Re: Schema API: add-field-type
Hi Steve, responses inline below: On Apr 29, 2015, at 6:50 PM, Steven White swhite4...@gmail.com wrote: Hi Everyone, When I pass the following: http://localhost:8983/solr/db/schema/fieldtypes?wt=xml I see this (as one example): lst str name=namedate/str str name=classsolr.TrieDateField/str str name=precisionStep0/str str name=positionIncrementGap0/str arr name=fields strlast_modified/str /arr arr name=dynamicFields str*_dts/str str*_dt/str /arr /lst See how there is fields and dynamicfields? However, when I look in schema.xml, I see this: fieldType name=date class=solr.TrieDateField precisionStep=0 positionIncrementGap=0/ See how there is nothing about fields and dynamicfields. Now, when I look further into the schema.xml, I see they are coming from: field name=last_modified type=date indexed=true stored=true/ dynamicField name=*_dt type=dateindexed=true stored=true/ dynamicField name=*_dts type=dateindexed=true stored=true multiValued=true/ So it all makes sense. Does this means the response of fieldtypes includes fields and dynamicfields as syntactic-sugar to let me know of the relationship this field-type has or is there more to it? It’s FYI: this is the full list of fields and dynamic fields that use the given fieldtype. The reason why I care about this question is because I'm using Solr's Schema API (see: https://cwiki.apache.org/confluence/display/solr/Schema+API) to make changes to my schema. Per this link: https://cwiki.apache.org/confluence/display/solr/Schema+API#SchemaAPI-AddaNewFieldType it shows how to add a field-type via add-field-type but there is no mention of fields or dynamicfields in this API. My assumption is fields and dynamicfields need not be part of this API, instead it is done via add-field and add-dynamic-field, thus what I see in the XML of fieldtypes response is just syntactic-sugar. Did I get all this right? Yes, as you say, to add (dynamic) fields after adding a field type, you must use the “add-field” and “add-dynamic-field” commands. Note that you can do so in a single request if you like, as long as “add-field-type” is ordered before any referencing “add-field”/“add-dynamic-field” command. To be clear, the “add-field-type” command does not support passing in a set of fields and/or dynamic fields to be added with the new field type. Steve
Re: Editing the Solr Wiki
you should be good to go, thanks (in advance) for helping out with your edits. : http://www.manning.com/turnbull/. I have already set up an account with : the username NicoleButterfield. Many thanks in advance for your help -Hoss http://www.lucidworks.com/
Re: Multiple index.timestamp directories using up disk space
Hi Shawn, Thanks for clarifying lucene segment behaviour. We don't trigger optimize externally, could it be internal solr optimize? Is there a setting/ knob to control when optimize occurs. Thanks for pointing it out, will monitor memory closely. Though doubt memory is an issue, these are top tier machines with 144GB RAM supporting 12x4GB JVM's. Out of which 9 JVM's are running in cloud mode writing to SSD, should be enough memory leftover for OS cache. The behaviour we see multiple huge directories for the same core. Till we figure out what's going on, the only option we are left with it is to clean up the entire index to free up disk space, and allow a replica to sync from scratch. Thanks, Rishi. -Original Message- From: Shawn Heisey apa...@elyograg.org To: solr-user solr-user@lucene.apache.org Sent: Tue, May 5, 2015 10:55 am Subject: Re: Multiple index.timestamp directories using up disk space On 5/5/2015 7:29 AM, Rishi Easwaran wrote: Worried about data loss makes sense. If I get the way solr behaves, the new directory should only have missing/changed segments. I guess since our application is extremely write heavy, with lot of inserts and deletes, almost every segment is touched even during a short window, so it appears like for our deployment every segment is copied over when replicas get out of sync. Once a segment is written, it is *NEVER* updated again. This aspect of Lucene indexes makes Solr replication more efficient. The ids of deleted documents are written to separate files specifically for tracking deletes. Those files are typically quite small compared to the index segments. Any new documents are inserted into new segments. When older segments are merged, the information in all of those segments is copied to a single new segment (minus documents marked as deleted), and then the old segments are erased. Optimizing replaces the entire index, and each replica of the index would be considered different, so an index recovery that happens after optimization might copy the whole thing. If you are seeing a lot of index recoveries during normal operation, chances are that your Solr servers do not have enough resources, and the resource that has the most impact on performance is memory. The amount of memory required for good Solr performance is higher than most people expect. It's a normal expectation that programs require memory to run, but Solr has an additional memory requirement that often surprises them -- the need for a significant OS disk cache: http://wiki.apache.org/solr/SolrPerformanceProblems Thanks, Shawn
Re: Multiple index.timestamp directories using up disk space
Yes, data loss is the concern. If the recovering replica is not able to retrieve the files from the leader, it at least has an older copy. Also, the entire index is not fetched from the leader, only the segments which have changed. The replica initially gets the file list from the replica, checks against what it has, and then downloads the difference -- then moves it to the main index. Note that this process can fail sometimes (say due to I/O errors, or due to a problem with the leader itself), in which case the replica drops all accumulated files from the leader, and starts from scratch. If that happens, it needs to look back at its old index again to figure out what it needs to download on the next attempt. May be with a fair number of assumptions which should usually hold good, you can still come up with a mechanism to drop existing files, but those won't hold good in case of serious issues with the cloud, you could end up losing data. That's worse than using a bit more disk space! On 4 May 2015 11:56, Rishi Easwaran rishi.easwa...@aol.com wrote: Thanks for the responses Mark and Ramkumar. The question I had was, why does Solr need 2 copies at any given time, leading to 2x disk space usage. Not sure if this information is not published anywhere, and makes HW estimation almost impossible for large scale deployment. Even if the copies are temporary, this becomes really expensive, especially when using SSD in production, when the complex size is over 400TB indexes, running 1000's of solr cloud shards. If a solr follower has decided that it needs to do replication from leader and capture full copy snapshot. Why can't it delete the old information and replicate from scratch, not requiring more disk space. Is the concern data loss (a case when both leader and follower lose data)?. Thanks, Rishi. -Original Message- From: Mark Miller markrmil...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tue, Apr 28, 2015 10:52 am Subject: Re: Multiple index.timestamp directories using up disk space If copies of the index are not eventually cleaned up, I'd fill a JIRA to address the issue. Those directories should be removed over time. At times there will have to be a couple around at the same time and others may take a while to clean up. - Mark On Tue, Apr 28, 2015 at 3:27 AM Ramkumar R. Aiyengar andyetitmo...@gmail.com wrote: SolrCloud does need up to twice the amount of disk space as your usual index size during replication. Amongst other things, this ensures you have a full copy of the index at any point. There's no way around this, I would suggest you provision the additional disk space needed. On 20 Apr 2015 23:21, Rishi Easwaran rishi.easwa...@aol.com wrote: Hi All, We are seeing this problem with solr 4.6 and solr 4.10.3. For some reason, solr cloud tries to recover and creates a new index directory - (ex:index.20150420181214550), while keeping the older index as is. This creates an issues where the disk space fills up and the shard never ends up recovering. Usually this requires a manual intervention of bouncing the instance and wiping the disk clean to allow for a clean recovery. Any ideas on how to prevent solr from creating multiple copies of index directory. Thanks, Rishi.
Re: Slow highlighting on Solr 5.0.0
I'm seeing the same with Solr 5.1.0 after upgrading from 4.10.2. Here are my timings: 4.10.2: process: 1432.0 highlight: 723.0 5.1.0: process: 9570.0 highlight: 8790.0 schema.xml and solrconfig.xml are available at https://github.com/NatLibFi/NDL-VuFind-Solr/tree/master/vufind/biblio/conf. A couple of jstack outputs taken when the query was executing are available at http://pastebin.com/eJrEy2Wb Any suggestions would be appreciated. Or would it make sense to just file a JIRA issue? --Ere 3.3.2015, 0.48, Matt Hilt kirjoitti: Short form: While testing Solr 5.0.0 within our staging environment, I noticed that highlight enabled queries are much slower than I saw with 4.10. Are there any obvious reasons why this might be the case? As far as I can tell, nothing has changed with the default highlight search component or its parameters. A little more detail: The bulk of the collection config set was stolen from the basic 4.X example config set. I changed my schema.xml and solrconfig.xml just enough to get 5.0 to create a new collection (removed non-trie fields, some other deprecated response handler definitions, etc). I can provide my version of the solr.HighlightComponent config, but it is identical to the sample_techproducts_configs example in 5.0. Are there any other config files I could provide that might be useful? Number on “much slower”: I indexed a very small subset of my data into the new collection and used the /select interface to do a simple debug query. Solr 4.10 gives the following pertinent info: response: { numFound: 72628, ... debug: { timing: { time: 95, process: { time: 94, query: { time: 6 }, highlight: { time: 84 }, debug: { time: 4 } } --- Whereas solr 5.0 is: response: { numFound: 1093, ... debug: { timing: { time: 6551, process: { time: 6549, query: { time: 0 }, highlight: { time: 6524 }, debug: { time: 25 } -- Ere Maijala Kansalliskirjasto / The National Library of Finland
Proximity searching in percentage
Hi, Would like to check, how do we implement character proximity searching that's in terms of percentage with regards to the length of the word, instead of a fixed number of edit distance (characters)? For example, if we have a proximity of 20%, a word with 5 characters will have an edit distance of 1, and a word with 10 characters will automatically have an edit distance of 2. Will Solr be able to do that for us? Regards, Edwin
Solr Exception The remote server returned an error: (400) Bad Request.
Hi, I am having some difficulties knowing which one is the exception I am having on my client for some queries. Queries malformed are always coming back to my solrNet client as The remote server returned an error: (400) Bad Request.. Internally Solr is actually printing the log issues like undefined field fieldName. Do your have any idea about getting more detailed info into the http response? Thanks, Sergio -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Exception-The-remote-server-returned-an-error-400-Bad-Request-tp4203889.html Sent from the Solr - User mailing list archive at Nabble.com.
Limit Results By Score?
Hi, We have implemented a custom scoring function and also need to limit the results by score. How could we go about that? Alternatively, can we suppress the results early using some kind of custom filter? --Johannes -- Dr. Johannes Ruscheinski Universitätsbibliothek Tübingen - IT-Abteilung - Wilhelmstr. 32, 72074 Tübingen Tel: +49 7071 29-72820 FAX: +49 7071 29-5069 Email: johannes.ruschein...@uni-tuebingen.de
Re: Solr Exception The remote server returned an error: (400) Bad Request.
Take a look at query parameters and use debug and/or explain. https://wiki.apache.org/solr/CommonQueryParameters Also, perhaps change parser from default one to less stringent dismax. Hard to say what fits your case as I don't know it, but those two are best starting points I know of. pozdrawiam, LAFK 2015-05-05 10:39 GMT+02:00 marotosg marot...@gmail.com: Hi, I am having some difficulties knowing which one is the exception I am having on my client for some queries. Queries malformed are always coming back to my solrNet client as The remote server returned an error: (400) Bad Request.. Internally Solr is actually printing the log issues like undefined field fieldName. Do your have any idea about getting more detailed info into the http response? Thanks, Sergio -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-Exception-The-remote-server-returned-an-error-400-Bad-Request-tp4203889.html Sent from the Solr - User mailing list archive at Nabble.com.