Re: prefix length in fuzzy search solr 4.10.1
ok, thanks for the answer. best regards, Elisabeth 2014-10-31 22:04 GMT+01:00 Jack Krupansky j...@basetechnology.com: No, but it is a reasonable request, as a global default, a collection-specific default, a request-specific default, and on an individual fuzzy term. -- Jack Krupansky -Original Message- From: elisabeth benoit Sent: Thursday, October 30, 2014 6:07 AM To: solr-user@lucene.apache.org Subject: prefix length in fuzzy search solr 4.10.1 Hello all, Is there a parameter in solr 4.10.1 api allowing user to fix prefix length in fuzzy search. Best regards, Elisabeth
Re: Consul instead of ZooKeeper anyone?
Hello Greg, Consul and Zookeeper are quite similar in their offering with respect to what SolrCloud needs. Service discovery, watches on distributed cluster state, updates of configuration could all be handled through Consul. Plus, Consul does offer built-in capabilities for multi-datacenter scenarios and encryption. Also, the capability to inquire Consul via DNS, i.e., without any client-side library requirements, is quite compelling. One could integrate Java, C/C++, C#/.NET, Python, Ruby and other types of clients without much effort. The largest benefit, however, I would see for the zoo of services around Solr. At least in my experience, SolrCloud for serious applications is never deployed by itself. There will be numerous services for data collection, semantic processing, log management, monitoring, administration, reporting and user front-ends around the core SolrCloud. This zoo is hard to manage and especially the coordination of configuration and cluster consistency is hard to manage. Consul could help here as it comes from the more operations-type level of managing an elastic set of services in data centers. So, after singing the praises, why have I not started using Consul then? :-) First and foremost: Zookeeper from the Hadoop/Apache ecosystem is already integrated with SolrCloud. Ripping it out and replacing it with something similar but not quite the same would require significant effort, esp. for testing this thoroughly. My clients are not willing to pay for basic groundworks. Second: Consul looks nice but documentation leaves many questions open. Once you start setting it up, there will be questions where you have to dive into the code for answers. Consul does not give me the same mature impression as Zookeeper. So, I am still using our own service management framework for the zoo of services in typical search clouds. Consul is young, however, and may evolve. The version is 0.4.1 and I don't use anything with a zero in front to manage a serious customer infrastructure. Would you trust the a customer's 50-100 TB of source data to a set of SolrClouds based on a 0.x Consul? ;-) Third: Consul lacks a decent integration with log management. In any distributed environment, you don't just want to keep a snapshot of the moment, but rather a possibly long history of state changes and statistics, so there is a chance to not just monitor, but also to act. In that respect, we would need more of cloud management recipes integrated, without having to pull out the entire Puppet or Chef stack that will come with its own view of the world. That again is a topic of maturity and being fit for real-life requirements. I would love to see Consul evolve into that type of lightweight cloud management with basic services integrated. But: some way to go still. There are other issues, but these are the major ones from my perspective. So, the concept is nice, Hashimoto et al. are known to be creative heads, and therefore I will keep watching what's happing there, but I won't use Consul for any real customer projects yet - not even that part that is not SolrCloud-dependent. Best regards, --Jürgen On 01.11.2014 00:08, Greg Solovyev wrote: I am investigating a project to make SolrCloud run on Consul instead of ZooKeeper. So far, my research revealed no such efforts, but I wanted to check with this list to make sure I am not going to be reinventing the wheel. Have anyone attempted using Consul instead of ZK to coordinate SolrCloud nodes? Thanks, Greg -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением *i.A. Jürgen Wagner* Head of Competence Center Intelligence Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: juergen.wag...@devoteam.com mailto:juergen.wag...@devoteam.com, URL: www.devoteam.de http://www.devoteam.de/ Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
Re: Sharding configuration
On 30 Oct 2014 23:46, Erick Erickson erickerick...@gmail.com wrote: This configuration deals with all the replication, NRT processing, self-repair when nodes go up and down and all that, but since there's no second trip to get the docs from shards your query performance won't be affected. More or less.. Vaguely recall that you still would need to add a shortCircuit parameter to the url in such a case to avoid a second trip. I might be wrong here but I do recall wondering why that wasn't the default.. And using SolrCloud with a single shard will essentially scale linearly as you add nodes for queries. Best, Erick On Thu, Oct 30, 2014 at 8:29 AM, Anca Kopetz anca.kop...@kelkoo.com wrote: Hi, You are right, it is a mistake in my phrase, for the tests with 4 shards/ 4 instances, the latency was worse (therefore *bigger*) than for the tests with one shard. In our case, the query rate is high. Thanks, Anca On 10/30/2014 03:48 PM, Shawn Heisey wrote: On 10/30/2014 4:32 AM, Anca Kopetz wrote: We did some tests with 4 shards / 4 different tomcat instances on the same server and the average latency was smaller than the one when having only one shard. We tested also é shards on different servers and the performance results were also worse. It seems that the sharding does not make any difference for our index in terms of latency gains. That statement is confusing, because if latency goes down, that's good, not worse. If you're going to put multiple shards on one server, it should be done with one solr/tomcat instance, not multiple. One instance is perfectly capable of dealing with many shards, and has a lot less overhead. The SolrCloud collection create command would need the maxShardsPerNode parameter. In order to see a gain in performance from multiple shards per server, the server must have a lot of CPUs and the query rate must be fairly low. If the query rate is high, then all the CPUs will be busy just handling simultaneous queries, so putting multiple shards per server will probably slow things down. When query rate is low, multiple CPUs can handle each shard query simultaneously, speeding up the overall query. Thanks, Shawn Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
Re: Sharding configuration
On 30 Oct 2014 14:49, Shawn Heisey apa...@elyograg.org wrote: In order to see a gain in performance from multiple shards per server, the server must have a lot of CPUs and the query rate must be fairly low. If the query rate is high, then all the CPUs will be busy just handling simultaneous queries, so putting multiple shards per server will probably slow things down. When query rate is low, multiple CPUs can handle each shard query simultaneously, speeding up the overall query. Except that your query latency isn't always CPU bound, there's a significant IO bound portion as well. I wouldn't go so far as to say that will large query volumes you shouldn't use multiple shards -- finally comes down to how many shards a machine can handle under peak load, it could depend on CPU/IO/GC pressure.. We have multiple shards on a machine under heavy query load for example. The only real way is to benchmark this and see.. Thanks, Shawn
Re: How to update SOLR schema from continuous integration environment
In all honesty, incrementally updating resources of a production server is a rather frightening proposition. Parallel testing is always a better way to go - bring up any changes in a parallel system for testing and then do an atomic swap - redirection of requests from the old server to the new server and then retire the old server only after the new server has had enough time to burn in and get past any infant mortality problems. That's production. Testing and dev? Who needs the hassle; just tear the old server down and bring up the new server from scratch with all resources updated from the get-go. Oh, and the starting point would be keeping your full set of config and resource files under source control so that you can carefully review changes before they are pushed, can compare different revisions, and can easily back out a revision with confidence rather than winging it. That said, a lot of production systems these days are not designed for parallel operation and swapping out parallel systems, especially for cloud and cluster systems. In these cases the reality is more of a rolling update, where one node at a time is taken down, updated, brought up, tested, brought back into production, tested some more, and only after enough burn in time do you move to the next node. This rolling update may also force you to sequence or stage your changes so that old and new nodes are at least relatively compatible. So, the first stage would update all nodes, one at a time, to the intermediate compatible change, and only when that rolling update of all nodes is complete would you move up to the next stage of the update to replace the intermediate update with the final update. And maybe more than one intermediate stage is required for more complex updates. Some changes might involve upgrading Java jars as well, in a way that might cause nodes give incompatible results, in which case you may need to stage or sequence your Java changes as well, so that you don't make the final code change until you have verified that all nodes have compatible intermediate code that is compatible with both old nodes and new nodes. Of course, it all depends on the nature of the update. For example, adding more synonyms may or may not be harmless with respect to whether existing index data becomes invalidated and each node needs to be completely reindexed, or if query-time synonyms are incompatible with index-time synonyms. Ditto for just about any analysis chain changes - they may be harmless, they may require full reindexing, they may simply not work for new data (i.e., a synonym is added in response to late-breaking news or an addition to a taxonomy) until nodes are updated, or maybe some queries become slightly or somewhat inaccurate until the update/reindex is complete. So, you might want to have two stages of test system - one to just do a raw functional test of the changes, like whether your new synonyms work as expected or not, and then the pre-production stage which would be updated using exactly the same process as the production system, such as a rolling update or staged rolling update as required. The closer that pre-production system is run to the actual production, the greater the odds that you can have confidence that the update won't compromise the production system. The pre-production test system might have, say, 10% of the production data and by only 10% the size of the production system. In short, for smaller clusters having parallel systems with an atomic swap/redirection is probably simplest, while for larger clusters an incremental rolling update with thorough testing on a pre-production test cluster is the way to go. -- Jack Krupansky -Original Message- From: Faisal Mansoor Sent: Saturday, November 1, 2014 12:10 AM To: solr-user@lucene.apache.org Subject: How to update SOLR schema from continuous integration environment Hi, How do people usually update Solr configuration files from continuous integration environment like TeamCity or Jenkins. We have multiple development and testing environments and use WebDeploy and AwsDeploy type of tools to remotely deploy code multiple times a day, to update solr I wrote a simple node server which accepts conf folder over http, updates the specified conf core folder and restarts the solr service. Does there exists a standard tool for this uses case. I know about schema rest api, but, I want to update all the files in the conf folder rather than just updating a single file or adding or removing synonyms piecemeal. Here is the link for the node server I mentioned if anyone is interested. https://github.com/faisalmansoor/UpdateSolrConfig Thanks, Faisal
RE: How to update SOLR schema from continuous integration environment
http://www.thoughtworks.com/insights/blog/enabling-continuous-delivery-enterprises-testing -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Saturday, November 01, 2014 9:46 AM To: solr-user@lucene.apache.org Subject: Re: How to update SOLR schema from continuous integration environment In all honesty, incrementally updating resources of a production server is a rather frightening proposition. Parallel testing is always a better way to go - bring up any changes in a parallel system for testing and then do an atomic swap - redirection of requests from the old server to the new server and then retire the old server only after the new server has had enough time to burn in and get past any infant mortality problems. That's production. Testing and dev? Who needs the hassle; just tear the old server down and bring up the new server from scratch with all resources updated from the get-go. Oh, and the starting point would be keeping your full set of config and resource files under source control so that you can carefully review changes before they are pushed, can compare different revisions, and can easily back out a revision with confidence rather than winging it. That said, a lot of production systems these days are not designed for parallel operation and swapping out parallel systems, especially for cloud and cluster systems. In these cases the reality is more of a rolling update, where one node at a time is taken down, updated, brought up, tested, brought back into production, tested some more, and only after enough burn in time do you move to the next node. This rolling update may also force you to sequence or stage your changes so that old and new nodes are at least relatively compatible. So, the first stage would update all nodes, one at a time, to the intermediate compatible change, and only when that rolling update of all nodes is complete would you move up to the next stage of the update to replace the intermediate update with the final update. And maybe more than one intermediate stage is required for more complex updates. Some changes might involve upgrading Java jars as well, in a way that might cause nodes give incompatible results, in which case you may need to stage or sequence your Java changes as well, so that you don't make the final code change until you have verified that all nodes have compatible intermediate code that is compatible with both old nodes and new nodes. Of course, it all depends on the nature of the update. For example, adding more synonyms may or may not be harmless with respect to whether existing index data becomes invalidated and each node needs to be completely reindexed, or if query-time synonyms are incompatible with index-time synonyms. Ditto for just about any analysis chain changes - they may be harmless, they may require full reindexing, they may simply not work for new data (i.e., a synonym is added in response to late-breaking news or an addition to a taxonomy) until nodes are updated, or maybe some queries become slightly or somewhat inaccurate until the update/reindex is complete. So, you might want to have two stages of test system - one to just do a raw functional test of the changes, like whether your new synonyms work as expected or not, and then the pre-production stage which would be updated using exactly the same process as the production system, such as a rolling update or staged rolling update as required. The closer that pre-production system is run to the actual production, the greater the odds that you can have confidence that the update won't compromise the production system. The pre-production test system might have, say, 10% of the production data and by only 10% the size of the production system. In short, for smaller clusters having parallel systems with an atomic swap/redirection is probably simplest, while for larger clusters an incremental rolling update with thorough testing on a pre-production test cluster is the way to go. -- Jack Krupansky -Original Message- From: Faisal Mansoor Sent: Saturday, November 1, 2014 12:10 AM To: solr-user@lucene.apache.org Subject: How to update SOLR schema from continuous integration environment Hi, How do people usually update Solr configuration files from continuous integration environment like TeamCity or Jenkins. We have multiple development and testing environments and use WebDeploy and AwsDeploy type of tools to remotely deploy code multiple times a day, to update solr I wrote a simple node server which accepts conf folder over http, updates the specified conf core folder and restarts the solr service. Does there exists a standard tool for this uses case. I know about schema rest api, but, I want to update all the files in the conf folder rather than just updating a single file or adding or removing synonyms piecemeal. Here is the link for the node server I mentioned if anyone is interested.
Re: Ideas for debugging poor SolrCloud scalability
Erick, Just to make sure I am thinking about this right: batching will certainly make a big difference in performance, but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Right now in my load tests, I'm not actually that concerned about the absolute performance numbers; instead I'm just trying to figure out why relative performance (no matter how bad it is since I am not batching) does not go up with more Solr nodes. Once I get that part figured out and we are seeing more writes per sec when we add nodes, then I'll turn on batching in the client to see what kind of additional performance gain that gets us. Cheers, Ian On Fri, Oct 31, 2014 at 3:43 PM, Peter Keegan peterlkee...@gmail.com wrote: Yes, I was inadvertently sending them to a replica. When I sent them to the leader, the leader reported (1000 adds) and the replica reported only 1 add per document. So, it looks like the leader forwards the batched jobs individually to the replicas. On Fri, Oct 31, 2014 at 3:26 PM, Erick Erickson erickerick...@gmail.com wrote: Internally, the docs are batched up into smaller buckets (10 as I remember) and forwarded to the correct shard leader. I suspect that's what you're seeing. Erick On Fri, Oct 31, 2014 at 12:20 PM, Peter Keegan peterlkee...@gmail.com wrote: Regarding batch indexing: When I send batches of 1000 docs to a standalone Solr server, the log file reports (1000 adds) in LogUpdateProcessor. But when I send them to the leader of a replicated index, the leader log file reports much smaller numbers, usually (12 adds). Why do the batches appear to be broken up? Peter On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson erickerick...@gmail.com wrote: NP, just making sure. I suspect you'll get lots more bang for the buck, and results much more closely matching your expectations if 1 you batch up a bunch of docs at once rather than sending them one at a time. That's probably the easiest thing to try. Sending docs one at a time is something of an anti-pattern. I usually start with batches of 1,000. And just to check.. You're not issuing any commits from the client, right? Performance will be terrible if you issue commits after every doc, that's totally an anti-pattern. Doubly so for optimizes Since you showed us your solrconfig autocommit settings I'm assuming not but want to be sure. 2 use a leader-aware client. I'm totally unfamiliar with Go, so I have no suggestions whatsoever to offer there But you'll want to batch in this case too. On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose ianr...@fullstory.com wrote: Hi Erick - Thanks for the detailed response and apologies for my confusing terminology. I should have said WPS (writes per second) instead of QPS but I didn't want to introduce a weird new acronym since QPS is well known. Clearly a bad decision on my part. To clarify: I am doing *only* writes (document adds). Whenever I wrote QPS I was referring to writes. It seems clear at this point that I should wrap up the code to do smart routing rather than choose Solr nodes randomly. And then see if that changes things. I must admit that although I understand that random node selection will impose a performance hit, theoretically it seems to me that the system should still scale up as you add more nodes (albeit at lower absolute level of performance than if you used a smart router). Nonetheless, I'm just theorycrafting here so the better thing to do is just try it experimentally. I hope to have that working today - will report back on my findings. Cheers, - Ian p.s. To clarify why we are rolling our own smart router code, we use Go over here rather than Java. Although if we still get bad performance with our custom Go router I may try a pure Java load client using CloudSolrServer to eliminate the possibility of bugs in our implementation. On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson erickerick...@gmail.com wrote: I'm really confused: bq: I am not issuing any queries, only writes (document inserts) bq: It's clear that once the load test client has ~40 simulated users bq: A cluster of 3 shards over 3 Solr nodes *should* support a higher QPS than 2 shards over 2 Solr nodes, right QPS is usually used to mean Queries Per Second, which is different from the statement that I am not issuing any queries. And what do the number of users have to do with inserting documents? You also state: In many cases, CPU on the solr servers is quite low as well So let's talk about indexing first. Indexing should scale nearly linearly as long as 1 you are routing your docs to the correct leader, which
Re: How to update SOLR schema from continuous integration environment
Nice pictures, but that preso does not even begin to answer the question. With master/slave replication, I do schema migration in two ways, depending on whether a field is added or removed. Adding a field: 1. Update the schema on the slaves. A defined field with no data is not a problem. 2. Update the master. 3. Reindex to populate the field and wait for replication. 4. Update the request handlers or clients to use the new field. Removing a field is the opposite. I haven’t tried lately, but Solr used to have problems with a field that was in the index but not in the schema. 1. Update the request handlers and clients to stop using the field. 2. Reindex without any data for the field that will be removed, wait for replication. 3. Update the schema on the master and slaves. I have not tried to automate this for continuous deployment. It isn’t a big deal for a single server test environment. It is the prod deployment that is tricky. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Nov 1, 2014, at 7:29 AM, Will Martin wmartin...@gmail.com wrote: http://www.thoughtworks.com/insights/blog/enabling-continuous-delivery-enterprises-testing -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Saturday, November 01, 2014 9:46 AM To: solr-user@lucene.apache.org Subject: Re: How to update SOLR schema from continuous integration environment In all honesty, incrementally updating resources of a production server is a rather frightening proposition. Parallel testing is always a better way to go - bring up any changes in a parallel system for testing and then do an atomic swap - redirection of requests from the old server to the new server and then retire the old server only after the new server has had enough time to burn in and get past any infant mortality problems. That's production. Testing and dev? Who needs the hassle; just tear the old server down and bring up the new server from scratch with all resources updated from the get-go. Oh, and the starting point would be keeping your full set of config and resource files under source control so that you can carefully review changes before they are pushed, can compare different revisions, and can easily back out a revision with confidence rather than winging it. That said, a lot of production systems these days are not designed for parallel operation and swapping out parallel systems, especially for cloud and cluster systems. In these cases the reality is more of a rolling update, where one node at a time is taken down, updated, brought up, tested, brought back into production, tested some more, and only after enough burn in time do you move to the next node. This rolling update may also force you to sequence or stage your changes so that old and new nodes are at least relatively compatible. So, the first stage would update all nodes, one at a time, to the intermediate compatible change, and only when that rolling update of all nodes is complete would you move up to the next stage of the update to replace the intermediate update with the final update. And maybe more than one intermediate stage is required for more complex updates. Some changes might involve upgrading Java jars as well, in a way that might cause nodes give incompatible results, in which case you may need to stage or sequence your Java changes as well, so that you don't make the final code change until you have verified that all nodes have compatible intermediate code that is compatible with both old nodes and new nodes. Of course, it all depends on the nature of the update. For example, adding more synonyms may or may not be harmless with respect to whether existing index data becomes invalidated and each node needs to be completely reindexed, or if query-time synonyms are incompatible with index-time synonyms. Ditto for just about any analysis chain changes - they may be harmless, they may require full reindexing, they may simply not work for new data (i.e., a synonym is added in response to late-breaking news or an addition to a taxonomy) until nodes are updated, or maybe some queries become slightly or somewhat inaccurate until the update/reindex is complete. So, you might want to have two stages of test system - one to just do a raw functional test of the changes, like whether your new synonyms work as expected or not, and then the pre-production stage which would be updated using exactly the same process as the production system, such as a rolling update or staged rolling update as required. The closer that pre-production system is run to the actual production, the greater the odds that you can have confidence that the update won't compromise the production system. The pre-production test system might have, say, 10% of the production data and by only 10% the size of the production system. In short,
Missing log entries with log4j log rotation
There appear to be large blocks of time missing in my solr logfiles created with slf4j-log4j and rotated using the log4j config: End of solr.log.1: INFO - 2014-10-31 12:52:25.073; Start of solr.log: INFO - 2014-11-01 02:27:27.404; End of solr.log.2: INFO - 2014-10-29 06:30:32.661; Start of solr.log.1: INFO - 2014-10-30 07:01:34.241; Queries happen at a fairly constant low level and updates happen once a minute, so I know for sure that there is activity during the missing blocks of time. I need to investigate a problem that occurred during the time that is not logged, which means I have nothing to investigate. This is the log4j configuration that I'm using: http://apaste.info/9vC These are the logging jars that I have in jetty's lib/ext: -rw-r--r-- 1 ncindex ncindex 16515 Apr 11 2014 jcl-over-slf4j-1.7.6.jar -rw-r--r-- 1 ncindex ncindex 4959 Apr 11 2014 jul-to-slf4j-1.7.6.jar -rw-r--r-- 1 ncindex ncindex 489883 Apr 11 2014 log4j-1.2.17.jar -rw-r--r-- 1 ncindex ncindex 28688 Apr 11 2014 slf4j-api-1.7.6.jar -rw-r--r-- 1 ncindex ncindex 8869 Apr 11 2014 slf4j-log4j12-1.7.6.jar Is this a bug, or have I done something wrong in my config? Should I be putting this on the log4j mailing list instead of here? My best guess about how this is happening is that an entire logfile is getting deleted during rotation. Thanks, Shawn
Re: Missing log entries with log4j log rotation
On 11/1/2014 11:45 AM, Shawn Heisey wrote: Is this a bug, or have I done something wrong in my config? Should I be putting this on the log4j mailing list instead of here? My best guess about how this is happening is that an entire logfile is getting deleted during rotation. I did find this blog post describing a similar problem with a different Appender: http://vivekagarwal.wordpress.com/2008/02/09/missing-log4j-log-files-with-dailyrollingfileappender-when-they-should-roll-over/ I'm not running on Windows, I'm on Linux, which normally does not have problems with renaming files even when they are open. My logfiles where I redirect stdout and stderr from Jetty don't show anything related, and I don't see anything like the error mentioned in any of the surviving logfiles from log4j. Thanks, Shawn
Re: Ideas for debugging poor SolrCloud scalability
bq: but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Not really. You've stated that you're not driving Solr very hard in your tests. Therefore you're waiting on I/O. Therefore your tests just aren't going to scale linearly with the number of shards. This is a simplification, but Your network utilization is pretty much irrelevant. I send a packet somewhere. somewhere does some stuff and sends me back an acknowledgement. While I'm waiting, the network is getting no traffic, so. If the network traffic was in the 90% range that would be different, so it's a good thing to monitor. Really, use a leader aware client and rack enough clients together that you're driving Solr hard. Then double the number of shards. Then rack enough _more_ clients to drive Solr at the same level. In this case I'll go out on a limb and predict near 2x throughput increases. One additional note, though. When you add _replicas_ to shards expect to see a drop in throughput that may be quite significant, 20-40% anecdotally... Best, Erick On Sat, Nov 1, 2014 at 9:23 AM, Shawn Heisey apa...@elyograg.org wrote: On 11/1/2014 9:52 AM, Ian Rose wrote: Just to make sure I am thinking about this right: batching will certainly make a big difference in performance, but it should be more or less a constant factor no matter how many Solr nodes you are using, right? Right now in my load tests, I'm not actually that concerned about the absolute performance numbers; instead I'm just trying to figure out why relative performance (no matter how bad it is since I am not batching) does not go up with more Solr nodes. Once I get that part figured out and we are seeing more writes per sec when we add nodes, then I'll turn on batching in the client to see what kind of additional performance gain that gets us. The basic problem I see with your methodology is that you are sending an update request and waiting for it to complete before sending another. No matter how big the batches are, this is an inefficient use of resources. If you send many such requests at the same time, then they will be handled in parallel. Lucene (and by extension, Solr) has the thread synchronization required to keep multiple simultaneous update requests from stomping on each other and corrupting the index. If you have enough CPU cores, such handling will *truly* be in parallel, otherwise the operating system will just take turns giving each thread CPU time. This results in a pretty good facsimile of parallel operation, but because it splits the available CPU resources, isn't as fast as true parallel operation. Thanks, Shawn
RE: How to update SOLR schema from continuous integration environment
Well yes. But since there hasn't been any devops approaches yet, we really aren't talking about Continuous Delivery. Continually delivering builds into production is old hat and Jack nailed the canonical manners in which it has been done. It really depends on whether an org is investing in the full Agile lifecycle. A piece at a time is common,. One possible devop approach: Once you get near full test automation : Jenkins builds the target : chef does due diligence on dependencies : chef pulls the build over. : chef configures the build once it is installed. :chef takes the machine out of the load-balancers rotation : chef puts the machine back in once it is launched and sanity tested (by chef). or puppet or any others I'm not familiar with If you substitute Jack's plan, you get pretty much the same thing; except that by using devops tools you introduce a little thing called idempotency. -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Saturday, November 01, 2014 12:25 PM To: solr-user@lucene.apache.org Subject: Re: How to update SOLR schema from continuous integration environment Nice pictures, but that preso does not even begin to answer the question. With master/slave replication, I do schema migration in two ways, depending on whether a field is added or removed. Adding a field: 1. Update the schema on the slaves. A defined field with no data is not a problem. 2. Update the master. 3. Reindex to populate the field and wait for replication. 4. Update the request handlers or clients to use the new field. Removing a field is the opposite. I haven't tried lately, but Solr used to have problems with a field that was in the index but not in the schema. 1. Update the request handlers and clients to stop using the field. 2. Reindex without any data for the field that will be removed, wait for replication. 3. Update the schema on the master and slaves. I have not tried to automate this for continuous deployment. It isn't a big deal for a single server test environment. It is the prod deployment that is tricky. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Nov 1, 2014, at 7:29 AM, Will Martin wmartin...@gmail.com wrote: http://www.thoughtworks.com/insights/blog/enabling-continuous-delivery-enter prises-testing -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Saturday, November 01, 2014 9:46 AM To: solr-user@lucene.apache.org Subject: Re: How to update SOLR schema from continuous integration environment In all honesty, incrementally updating resources of a production server is a rather frightening proposition. Parallel testing is always a better way to go - bring up any changes in a parallel system for testing and then do an atomic swap - redirection of requests from the old server to the new server and then retire the old server only after the new server has had enough time to burn in and get past any infant mortality problems. That's production. Testing and dev? Who needs the hassle; just tear the old server down and bring up the new server from scratch with all resources updated from the get-go. Oh, and the starting point would be keeping your full set of config and resource files under source control so that you can carefully review changes before they are pushed, can compare different revisions, and can easily back out a revision with confidence rather than winging it. That said, a lot of production systems these days are not designed for parallel operation and swapping out parallel systems, especially for cloud and cluster systems. In these cases the reality is more of a rolling update, where one node at a time is taken down, updated, brought up, tested, brought back into production, tested some more, and only after enough burn in time do you move to the next node. This rolling update may also force you to sequence or stage your changes so that old and new nodes are at least relatively compatible. So, the first stage would update all nodes, one at a time, to the intermediate compatible change, and only when that rolling update of all nodes is complete would you move up to the next stage of the update to replace the intermediate update with the final update. And maybe more than one intermediate stage is required for more complex updates. Some changes might involve upgrading Java jars as well, in a way that might cause nodes give incompatible results, in which case you may need to stage or sequence your Java changes as well, so that you don't make the final code change until you have verified that all nodes have compatible intermediate code that is compatible with both old nodes and new nodes. Of course, it all depends on the nature of the update. For example, adding more synonyms may or may not be harmless with respect to whether existing index data becomes invalidated and each node needs to be completely reindexed, or if query-time synonyms are
Re: How to update SOLR schema from continuous integration environment
You do that with schema changes and I’ll watch your site crash. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Nov 1, 2014, at 8:31 PM, Will Martin wmartin...@gmail.com wrote: Well yes. But since there hasn't been any devops approaches yet, we really aren't talking about Continuous Delivery. Continually delivering builds into production is old hat and Jack nailed the canonical manners in which it has been done. It really depends on whether an org is investing in the full Agile lifecycle. A piece at a time is common,. One possible devop approach: Once you get near full test automation : Jenkins builds the target : chef does due diligence on dependencies : chef pulls the build over. : chef configures the build once it is installed. :chef takes the machine out of the load-balancers rotation : chef puts the machine back in once it is launched and sanity tested (by chef). or puppet or any others I'm not familiar with If you substitute Jack's plan, you get pretty much the same thing; except that by using devops tools you introduce a little thing called idempotency. -Original Message- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Saturday, November 01, 2014 12:25 PM To: solr-user@lucene.apache.org Subject: Re: How to update SOLR schema from continuous integration environment Nice pictures, but that preso does not even begin to answer the question. With master/slave replication, I do schema migration in two ways, depending on whether a field is added or removed. Adding a field: 1. Update the schema on the slaves. A defined field with no data is not a problem. 2. Update the master. 3. Reindex to populate the field and wait for replication. 4. Update the request handlers or clients to use the new field. Removing a field is the opposite. I haven't tried lately, but Solr used to have problems with a field that was in the index but not in the schema. 1. Update the request handlers and clients to stop using the field. 2. Reindex without any data for the field that will be removed, wait for replication. 3. Update the schema on the master and slaves. I have not tried to automate this for continuous deployment. It isn't a big deal for a single server test environment. It is the prod deployment that is tricky. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ On Nov 1, 2014, at 7:29 AM, Will Martin wmartin...@gmail.com wrote: http://www.thoughtworks.com/insights/blog/enabling-continuous-delivery-enter prises-testing -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Saturday, November 01, 2014 9:46 AM To: solr-user@lucene.apache.org Subject: Re: How to update SOLR schema from continuous integration environment In all honesty, incrementally updating resources of a production server is a rather frightening proposition. Parallel testing is always a better way to go - bring up any changes in a parallel system for testing and then do an atomic swap - redirection of requests from the old server to the new server and then retire the old server only after the new server has had enough time to burn in and get past any infant mortality problems. That's production. Testing and dev? Who needs the hassle; just tear the old server down and bring up the new server from scratch with all resources updated from the get-go. Oh, and the starting point would be keeping your full set of config and resource files under source control so that you can carefully review changes before they are pushed, can compare different revisions, and can easily back out a revision with confidence rather than winging it. That said, a lot of production systems these days are not designed for parallel operation and swapping out parallel systems, especially for cloud and cluster systems. In these cases the reality is more of a rolling update, where one node at a time is taken down, updated, brought up, tested, brought back into production, tested some more, and only after enough burn in time do you move to the next node. This rolling update may also force you to sequence or stage your changes so that old and new nodes are at least relatively compatible. So, the first stage would update all nodes, one at a time, to the intermediate compatible change, and only when that rolling update of all nodes is complete would you move up to the next stage of the update to replace the intermediate update with the final update. And maybe more than one intermediate stage is required for more complex updates. Some changes might involve upgrading Java jars as well, in a way that might cause nodes give incompatible results, in which case you may need to stage or sequence your Java changes as well, so that you don't make the final code change until you have verified that all nodes have compatible intermediate code that is