How to insert documents into differenet indexes
Hi, I 've set up a Solr instance with multiple cores to be able to use different indexes for different applications. The point I'm struggling with is how do I insert documents into the index running on a specific core? Any clue appreciated. best -- tomw t...@ubilix.com
Re: How to insert documents into differenet indexes
Hello! Just use the update handler that is specific to a given core. For example if you have two cores named core1 and core2, you should use the following addresses (if you didn't change the default configuration): /solr/core1/update/ and /solr/core2/update/ -- Regards, Rafał Kuć Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch Hi, I 've set up a Solr instance with multiple cores to be able to use different indexes for different applications. The point I'm struggling with is how do I insert documents into the index running on a specific core? Any clue appreciated. best
Re: How to insert documents into differenet indexes
Just use the update handler that is specific to a given core. For example if you have two cores named core1 and core2, you should use the following addresses (if you didn't change the default configuration): /solr/core1/update/ and /solr/core2/update/ Thanks, that seems to work. Life can be so simple. Unfortunately this case isn't mentioned in any of the sections covering updates in the wiki.
Re: More references for configuring Solr
Hi, here are some resources: http://wiki.apache.org/solr/ (Solr wiki) http://lucene.apache.org/solr/books.html (books published on Solr) the goes googling on a specific topic. But before reading a book might not be a bad idea.. -- Dmitry On Sat, Nov 10, 2012 at 1:15 PM, FARAHZADI, EMAD emad.farahz...@netapp.comwrote: Dear Sir or Madam, ** ** I want to use to Solr for my final project in university in part of searching and indexing. I’d be appreciated if you send me more resources or documentations about Solr. ** ** Regards ** ** ** ** * Emad Farahzadi [image: brand-site-home-telescope-160x95]* *Professional Services Consultant* *NetApp Middle-East* * **Office: +971 4 4466203 Cell:+971 50 9197237* ** ** *NetApp MEA (Middle East Africa) Office No. 214 Building 2, 2nd Floor Dubai Internet City P.O. Box 500199 Dubai, U.A.E. * [image: netapp-cloud-esig-dollar] ** ** -- Regards, Dmitry Kan
Re: How to insert documents into differenet indexes
On 11 November 2012 15:06, tomw t...@ubilix.com wrote: [...] Thanks, that seems to work. Life can be so simple. Unfortunately this case isn't mentioned in any of the sections covering updates in the wiki. While this could be made clearer, it should not be very difficult to guess at the update URL for a specific core in a multi-core setup from the examples in http://wiki.apache.org/solr/CoreAdmin http://wiki.apache.org/solr/DataImportHandler#Full_Import_Example also mentions multiple cores in passing. Regards, Gora
Re: custom request handler
Only slaves are public facing and they are read only, with limited query request handlers defined. The above approach is to prevent abusive / in appropriate queries by clients. A query component sounds interesting would this be implemented through an interface so could be separate from solr or would it be sub classing a base component ? cheers lee c On 9 November 2012 17:24, Amit Nithian anith...@gmail.com wrote: Lee, I guess my question was if you are trying to prevent the big bad world from doing stuff they aren't supposed to in Solr, how are you going to prevent the big bad world from POSTing a delete all query? Or restrict them from hitting the admin console, looking at the schema.xml, solrconfig.xml. I guess the question here is who is the big bad world? The internet at large or employees/colleagues in your organization? If it's the internet at large then I'd totally decouple this from Solr b/c I want to be 100% sure that the *only* thing that the internet has access to is a GET on /select with some restrictions and this could be done in many places but it's not clear that coupling this to Solr is the place to do it. If the big bad world is just within your organization and you want some basic protections around what they can and can't see then what you are saying is reasonable to me. Also perhaps another option is to consider a query component rather than creating a sublcass of the request handler as a query component promotes more re-use and flexibility. You could make the necessary parameter changes in the prepare() method and just make sure that this safe parameter component comes before the query component in the list of components for a handler and you should be fine. Cheers! Amit On Fri, Nov 9, 2012 at 5:39 AM, Lee Carroll lee.a.carr...@googlemail.com wrote: Hi Amit I did not do this via a servlet filter as I wanted the solr devs to be concerned with solr config and keep them out of any concerns of the container. By specifying declarative data in a request handler that would be enough to produce a service uri for an application. Or have I missed a point ? We have several cores with several apps all with different data query needs. Maybe 20 request handlers needed to support this with active development on going. Basically I want it easy for devs to create a specific request handler suited to their needs. I thought a servletfilter developed and mainatined every time would be over kill. Again though I may have missed a point / over emphasised a difficulty? Are you saying my custom request handler is to tightly bound to solr? so the parameters my apps talk is not de-coupled enough from solr? Lee C On 7 November 2012 19:49, Amit Nithian anith...@gmail.com wrote: Why not do this in a ServletFilter? Alternatively, I'd just write a front end application servlet to do this so that you don't firewall your internal admins off from accessing the core Solr admin pages. I guess you could solve this using some form of security but I don't know this well enough. If I were to restrict access to certain parts of Solr, I'd do this outside of Solr itself and do this in a servlet or a filter, inspecting the parameters. It's easy to create a modifiable parameters class and populate that with acceptable parameters before the Solr filter operates on it. HTH Amit
Internal Vs. External ZooKeeper
OK, I can't find a definitive answer on this. The wiki says not to use the embedded ZooKeeper servers for production. But my question is: why not? Basically, what are the reasons and circumstances that make you better off using an external ZooKeeper ensemble? Thanks... Nick
Re: Internal Vs. External ZooKeeper
Production typically implies high availability and in a distributed system the goal is that the overall cluster integrity and performance should not be compromised just because a few worker nodes go down. Solr nodes do a lot of complex operations and are quite prone to running into issues that compromise their integrity and require that they be taken down, restarted, etc. In fact, taking down a bunch of Solr worker nodes should not be a big deal (unless they are all of the nodes/replicas from a single shard/slice), while taking down a bunch of zookeepers could be catastrophic to maintaining the integrity of the zookeeper ensemble. (OTOH, if every Solr node is also a zookeeper node, a bunch of Solr nodes would generally be less than a quorum, so maybe that is not an absolute issue per se.) Zookeeper nodes are categorically distinct in terms of their importance to maintaining the integrity and availability of the overall cluster. They are special in that sense. And they are special because they are maintaining the integrity of the cluster's configuration information. Even for large clusters their number will be relatively few compared to the many of worker nodes (replicas), so zookeeper nodes need to be protected from the vagaries that can disrupt and take Solr nodes down, not the least of which is incoming traffic. I'm not sure what the implications would be if you had a large cluster and because Zookeeper was embedded you had a large number of zookeepers. Any of the inter-zookeeper operations would take longer and could be compromised by even a single busy/overloaded/dead Solr node. OTOH, the Zookeeper ensemble design is supposed to be able to handle a far number of missing zookeeper nodes. OTOH, if high availability is not a requirement for a production cluster (use case?), then non-embedded zookeepers are certainly an annoyance. Maybe you could think of embedded zookeeper like every employee having their manager sitting right next to them all the time. How could that be anything but a bad idea in terms of maximizing worker output - and distracting/preventing managers from focusing on their own work? -- Jack Krupansky -Original Message- From: Nick Chase Sent: Sunday, November 11, 2012 7:12 AM To: solr-user@lucene.apache.org Subject: Internal Vs. External ZooKeeper OK, I can't find a definitive answer on this. The wiki says not to use the embedded ZooKeeper servers for production. But my question is: why not? Basically, what are the reasons and circumstances that make you better off using an external ZooKeeper ensemble? Thanks... Nick
Re: Preventing accepting queries while custom QueryComponent starts up?
Is the issue here that the Solr node is continuously live with the load balancer so that the moment during startup that Solr can respond to anything, the load balancer will be sending it traffic and that this can occur while Solr is still warming up? First, shouldn't we be encouraging people to have an app layer between Solr and the outside world? If so, the app layer should simply not respond to traffic until the app layer can verified that Solr has stabilized. If not, then maybe we do need to suggest a change to Solr so that the developer can control exactly when Solr becomes live and responsive to incoming traffic. At a minimum, we should document when that moment is today in terms of an explicit contract. It sounds like the problem is that the contract is either nonexistent, vague, ambiguous, non-deterministic, or whatever. -- Jack Krupansky -Original Message- From: Amit Nithian Sent: Saturday, November 10, 2012 4:24 PM To: solr-user@lucene.apache.org Subject: Re: Preventing accepting queries while custom QueryComponent starts up? Yeah that's what I was suggesting in my response too. I don't think your load balancer should be doing this but whatever script does the release (restarting the container) should do this so that when the ping is enabled the warming has finished. On Sat, Nov 10, 2012 at 3:33 PM, Erick Erickson erickerick...@gmail.comwrote: Hmmm, rather than hit the ping query, why not just send in a real query and only let the queued ones through after the response? Just a random thought Erick On Sat, Nov 10, 2012 at 2:53 PM, Amit Nithian anith...@gmail.com wrote: Yes but the problem is that if user facing queries are hitting a server that is warming up and isn't being serviced quickly, then you could potentially bring down your site if all the front end threads are blocked on Solr queries b/c those queries are waiting (presumably at the container level since the filter hasn't finished its init() sequence) for the warming to complete (this is especially notorious when your front end is rails). This is why your ping to enable/disable a server from the load balancer has to be accurate with regards to whether or not a server is truly ready and warm. I think what I am gathering from this discussion is that the server is warming up, the ping is going through and tells the load balancer this server is ready, user queries are hitting this server and are queued waiting for the firstSearcher to finish (say these initial user queries are to respond in 500-1000ms) that's terrible for performance. Alternatively, if you have a bunch of servers behind a load balancer, you want this one server (or block of servers depending on your deployment) to be reasonably sure that user queries will return in a decent time (whatever you define decent to be) hence why this matters. Let me know if I am missing anything. Thanks Amit On Sat, Nov 10, 2012 at 10:03 AM, Erick Erickson erickerick...@gmail.com wrote: Why does it matter? The whole idea of firstSearcher queries is to warm up your system as fast as possible. The theory is that upon restarting the server, let's bet this stuff going immediately... They were never intended (as far as I know) to complete before any queries were handled. As an aside, I'm not quite sure I understand why pings during the warmup are a problem. But anyway. firstSearcher is particularly relevant because the autowarmCount settings on your caches are irrelevant when starting the server, there's no history to autowarm But, there's no good reason _not_ to let queries through while firstSearcher is doing it's tricks, they just get into the queue and are served as quickly as they may. That might be some time since, as you say, they may not get serviced until the expensive parts get filled. But I don't think having them be serviced is doing any harm. Now, newSearcher and autowarming of the caches is a completely different beast since having the old searchers continue serving requests until the warmups _does_ directly impact the user, they don't see random slowness because a searcher is being opened. So I guess my real question is whether you're seeing a measurable problem or if this is a red herring FWIW, Erick On Thu, Nov 8, 2012 at 2:54 PM, Aaron Daubman daub...@gmail.com wrote: Greetings, I have several custom QueryComponents that have high one-time startup costs (hashing things in the index, caching things from a RDBMS, etc...) Is there a way to prevent solr from accepting connections before all QueryComponents are ready? Especially, since many of our instance are load-balanced (and added-in/removed automatically based on admin/ping responses) preventing ping from answering prior to all custom QueryComponents being ready would be ideal... Thanks, Aaron
Re: Internal Vs. External ZooKeeper
Thanks, Jack, this is a great explanation! And since a greater number of ZK nodes tends to degrade write performance, that would be a factor in making every Solr node a ZK node as well. Much obliged! Nick On 11/11/2012 10:45 AM, Jack Krupansky wrote: Production typically implies high availability and in a distributed system the goal is that the overall cluster integrity and performance should not be compromised just because a few worker nodes go down. Solr nodes do a lot of complex operations and are quite prone to running into issues that compromise their integrity and require that they be taken down, restarted, etc. In fact, taking down a bunch of Solr worker nodes should not be a big deal (unless they are all of the nodes/replicas from a single shard/slice), while taking down a bunch of zookeepers could be catastrophic to maintaining the integrity of the zookeeper ensemble. (OTOH, if every Solr node is also a zookeeper node, a bunch of Solr nodes would generally be less than a quorum, so maybe that is not an absolute issue per se.) Zookeeper nodes are categorically distinct in terms of their importance to maintaining the integrity and availability of the overall cluster. They are special in that sense. And they are special because they are maintaining the integrity of the cluster's configuration information. Even for large clusters their number will be relatively few compared to the many of worker nodes (replicas), so zookeeper nodes need to be protected from the vagaries that can disrupt and take Solr nodes down, not the least of which is incoming traffic. I'm not sure what the implications would be if you had a large cluster and because Zookeeper was embedded you had a large number of zookeepers. Any of the inter-zookeeper operations would take longer and could be compromised by even a single busy/overloaded/dead Solr node. OTOH, the Zookeeper ensemble design is supposed to be able to handle a far number of missing zookeeper nodes. OTOH, if high availability is not a requirement for a production cluster (use case?), then non-embedded zookeepers are certainly an annoyance. Maybe you could think of embedded zookeeper like every employee having their manager sitting right next to them all the time. How could that be anything but a bad idea in terms of maximizing worker output - and distracting/preventing managers from focusing on their own work? -- Jack Krupansky -Original Message- From: Nick Chase Sent: Sunday, November 11, 2012 7:12 AM To: solr-user@lucene.apache.org Subject: Internal Vs. External ZooKeeper OK, I can't find a definitive answer on this. The wiki says not to use the embedded ZooKeeper servers for production. But my question is: why not? Basically, what are the reasons and circumstances that make you better off using an external ZooKeeper ensemble? Thanks... Nick
zkcli issues
OK, so this is my ZooKeeper week, sorry. :) So I'm trying to use ZkCLI without success. I DID start and stop Solr in non-cloud mode, so everything is extracted and it IS finding zookeeper*.jar. However, now it's NOT finding SolrJ. I even tried to run it from the provided script (in cloud-scripts) with no success. Here's what I've got: cd my-solr-install .\example\cloud-scripts\zkcli.bat -cmd upconfig -zkhost localhost:9983 -confdir example/solr/collection/conf -confname conf1 -solrhome example/solr set JVM=java set SDIR=C:\sw\apache-solr-4.0.0\example\cloud-scripts\ if \ == \ set SDIR=C:\sw\apache-solr-4.0.0\example\cloud-scripts java -classpath C:\sw\apache-solr-4.0.0\example\cloud-scripts\..\solr-webapp\webapp\WEB-INF\lib\* org.apache.solr.cloud.ZkCLI -cmd upconfig -zkhost localhost:9983 -confdir example/solr/collection/conf -confname conf1 -solrhome example/solr Error: Could not find or load main class C:\sw\apache-solr-4.0.0\example\cloud-scripts\..\solr-webapp\webapp\WEB-INF\lib\apache-solr-solrj-4.0.0.jar I've verified that C:\sw\apache-solr-4.0.0\example\cloud-scripts\..\solr-webapp\webapp\WEB-INF\lib\apache-solr-solrj-4.0.0.jar exists, so I'm really at a loss here. Thanks... Nick
Re: zkcli issues
On Sun, Nov 11, 2012 at 10:39 PM, Nick Chase nch...@earthlink.net wrote: So I'm trying to use ZkCLI without success. I DID start and stop Solr in non-cloud mode, so everything is extracted and it IS finding zookeeper*.jar. However, now it's NOT finding SolrJ. Not sure about your specific problem in this case, but I chatted with Mark about this while at ApacheCon... it seems like we should be able to explode the WAR ourselves if necessary, eliminating the need to start Solr first. Just throwing it out there before I forgot about it ;-) -Yonik http://lucidworks.com
Re: Apache Nutch 1.5.1 + Apache Solr 4.0
Hi, On 8 Nov 2012, at 15:00, Markus Jelsma markus.jel...@openindex.io wrote: Hm, i copied the schema from Nutch' trunk verbatim and only had to change the stemmer. It seems like you have, for some reason, a float with an extra point dangling around somewhere. Can you check? Just building a Nutch 1.5.1 environment and found this too. It is actually the version number in the schema.xml[1] and schema-solr4.xml[2]'s for the 1.5.1 branch that is the problem. In these file the version number reads: schema name=nutch version=1.5.1 Whereas in trunk[3] it is: schema name=nutch version=1.5 Obviously as the field is read as a float in the IndexSchema class 1.5.1 will fail due to the extra float. A quick change back to 1.5 in the file should solve things. Cheers, Dave [1] http://svn.apache.org/repos/asf/nutch/branches/branch-1.5.1/conf/schema.xml [2] http://svn.apache.org/repos/asf/nutch/branches/branch-1.5.1/conf/schema-solr4.xml [3] http://svn.apache.org/repos/asf/nutch/trunk/conf/schema.xml
Re: Internal Vs. External ZooKeeper
let me see if i get this correctly, greater the no.of zookeeper nodes , more the time it takes to come to a consensus. During an indexing operation, how many times does a solr client needs to contact zookeeper for consensus ? - per docs ? per commit ? ? thanks, Ani On Sun, Nov 11, 2012 at 11:17 AM, Nick Chase nch...@earthlink.net wrote: Thanks, Jack, this is a great explanation! And since a greater number of ZK nodes tends to degrade write performance, that would be a factor in making every Solr node a ZK node as well. Much obliged! Nick On 11/11/2012 10:45 AM, Jack Krupansky wrote: Production typically implies high availability and in a distributed system the goal is that the overall cluster integrity and performance should not be compromised just because a few worker nodes go down. Solr nodes do a lot of complex operations and are quite prone to running into issues that compromise their integrity and require that they be taken down, restarted, etc. In fact, taking down a bunch of Solr worker nodes should not be a big deal (unless they are all of the nodes/replicas from a single shard/slice), while taking down a bunch of zookeepers could be catastrophic to maintaining the integrity of the zookeeper ensemble. (OTOH, if every Solr node is also a zookeeper node, a bunch of Solr nodes would generally be less than a quorum, so maybe that is not an absolute issue per se.) Zookeeper nodes are categorically distinct in terms of their importance to maintaining the integrity and availability of the overall cluster. They are special in that sense. And they are special because they are maintaining the integrity of the cluster's configuration information. Even for large clusters their number will be relatively few compared to the many of worker nodes (replicas), so zookeeper nodes need to be protected from the vagaries that can disrupt and take Solr nodes down, not the least of which is incoming traffic. I'm not sure what the implications would be if you had a large cluster and because Zookeeper was embedded you had a large number of zookeepers. Any of the inter-zookeeper operations would take longer and could be compromised by even a single busy/overloaded/dead Solr node. OTOH, the Zookeeper ensemble design is supposed to be able to handle a far number of missing zookeeper nodes. OTOH, if high availability is not a requirement for a production cluster (use case?), then non-embedded zookeepers are certainly an annoyance. Maybe you could think of embedded zookeeper like every employee having their manager sitting right next to them all the time. How could that be anything but a bad idea in terms of maximizing worker output - and distracting/preventing managers from focusing on their own work? -- Jack Krupansky -Original Message- From: Nick Chase Sent: Sunday, November 11, 2012 7:12 AM To: solr-user@lucene.apache.org Subject: Internal Vs. External ZooKeeper OK, I can't find a definitive answer on this. The wiki says not to use the embedded ZooKeeper servers for production. But my question is: why not? Basically, what are the reasons and circumstances that make you better off using an external ZooKeeper ensemble? Thanks... Nick -- Anirudha P. Jadhav
Re: Internal Vs. External ZooKeeper
When SolrCloud is in a steady state (eg the number of nodes in the cluster is not changing and config is not changing), Solr does not really talk to ZooKeeper other than really light stuff like a heartbeat and maintaining a connection. So performance is not likely a large concern here. Mostly it's just a hassle because ZooKeeper does not currently support dynamically changing the nodes in an ensemble without doing a rolling restart. There are JIRA issues that are being worked on that will help with this though. Until then, it's just kind of a pain that some nodes have to be special or you have to do rolling restarts to make additional nodes part of the zk quorum. It's really up to you though - having the services separate just seems nicer to me. Easier to maintain. Often, once you start running ZooKeeper for one thing, you may end up running other things that use ZooKeeper as well - many people like to colocate this stuff on a single dedicated ZooKeeper ensemble. Embedded will run just fine - we simply recommend the other way to save headaches. If you know what you are getting into, it's certainly a valid choice. - Mark On 11/11/2012 05:11 PM, Anirudha Jadhav wrote: let me see if i get this correctly, greater the no.of zookeeper nodes , more the time it takes to come to a consensus. During an indexing operation, how many times does a solr client needs to contact zookeeper for consensus ? - per docs ? per commit ? ? thanks, Ani On Sun, Nov 11, 2012 at 11:17 AM, Nick Chase nch...@earthlink.net wrote: Thanks, Jack, this is a great explanation! And since a greater number of ZK nodes tends to degrade write performance, that would be a factor in making every Solr node a ZK node as well. Much obliged! Nick On 11/11/2012 10:45 AM, Jack Krupansky wrote: Production typically implies high availability and in a distributed system the goal is that the overall cluster integrity and performance should not be compromised just because a few worker nodes go down. Solr nodes do a lot of complex operations and are quite prone to running into issues that compromise their integrity and require that they be taken down, restarted, etc. In fact, taking down a bunch of Solr worker nodes should not be a big deal (unless they are all of the nodes/replicas from a single shard/slice), while taking down a bunch of zookeepers could be catastrophic to maintaining the integrity of the zookeeper ensemble. (OTOH, if every Solr node is also a zookeeper node, a bunch of Solr nodes would generally be less than a quorum, so maybe that is not an absolute issue per se.) Zookeeper nodes are categorically distinct in terms of their importance to maintaining the integrity and availability of the overall cluster. They are special in that sense. And they are special because they are maintaining the integrity of the cluster's configuration information. Even for large clusters their number will be relatively few compared to the many of worker nodes (replicas), so zookeeper nodes need to be protected from the vagaries that can disrupt and take Solr nodes down, not the least of which is incoming traffic. I'm not sure what the implications would be if you had a large cluster and because Zookeeper was embedded you had a large number of zookeepers. Any of the inter-zookeeper operations would take longer and could be compromised by even a single busy/overloaded/dead Solr node. OTOH, the Zookeeper ensemble design is supposed to be able to handle a far number of missing zookeeper nodes. OTOH, if high availability is not a requirement for a production cluster (use case?), then non-embedded zookeepers are certainly an annoyance. Maybe you could think of embedded zookeeper like every employee having their manager sitting right next to them all the time. How could that be anything but a bad idea in terms of maximizing worker output - and distracting/preventing managers from focusing on their own work? -- Jack Krupansky -Original Message- From: Nick Chase Sent: Sunday, November 11, 2012 7:12 AM To: solr-user@lucene.apache.org Subject: Internal Vs. External ZooKeeper OK, I can't find a definitive answer on this. The wiki says not to use the embedded ZooKeeper servers for production. But my question is: why not? Basically, what are the reasons and circumstances that make you better off using an external ZooKeeper ensemble? Thanks... Nick
Re: zkcli issues
On 11/11/2012 04:47 PM, Yonik Seeley wrote: On Sun, Nov 11, 2012 at 10:39 PM, Nick Chase nch...@earthlink.net wrote: So I'm trying to use ZkCLI without success. I DID start and stop Solr in non-cloud mode, so everything is extracted and it IS finding zookeeper*.jar. However, now it's NOT finding SolrJ. Not sure about your specific problem in this case, but I chatted with Mark about this while at ApacheCon... it seems like we should be able to explode the WAR ourselves if necessary, eliminating the need to start Solr first. Just throwing it out there before I forgot about it ;-) -Yonik http://lucidworks.com I guess the tricky part might be knowing where to extract it. We know how to do it for the default jetty setup, but that could be reconfigured or you could be using another web container. Kind of annoying. - Mark
Solr 4.0 - distributed updates without zookeeper?
Looking at how we could upgrade some of our infrastructure to Solr 4.0 - I would really like to take advantage of distributed updates to get NRT, but we want to keep our fixed master and slave server roles since we use different hardware appropriate to the different roles. Looking at the solr 4.0 distributed update code, it seems really hard-coded and bound to zookeeper. Is there a way to have a solr master distribute updates without using ZK, or a way to mock the ZK interface to provide a fixed cluster topography that will work when sending updates just to the master? To be clear, if the master goes doen I don't want a slave promoted, nor do I want most of the other SolrCloud features - we have already built out a system for managing groups of servers. Thanks, Peter
Re: Apache Nutch 1.5.1 + Apache Solr 4.0
Hi Steiner, I found a video tutorial on Nutch 1.4 + Solr 3.4.0 (on Windows). It do solve my error. Hope it do for yours too. Here is the link: Running Nutch and Solr on Windows Tutorial: Part 1 http://www.youtube.com/watch?v=baxhI6Wkov8 Running Nutch and Solr on Windows Tutorial: Part 2 http://www.youtube.com/watch?v=Qs-18hRRpNU Running Nutch and Solr on Windows Tutorial: Part 3 http://www.youtube.com/watch?v=GtbDHiYrlNE Published on Mar 15, 2012 by Dutedute2 Kind regards, Hanjoyo On Thu, Nov 8, 2012 at 4:52 PM, Antony Steiner ant.stei...@gmail.comwrote: Hello my name is Antony and I'm new to apache nutch and solr. I want to crawl my website and therefore I downloaded nutch to do this. This works fine. But no I would like to integrate nutch with solr. Im running this on my unix system. Im trying to follow this tutorial: http://wiki.apache.org/nutch/NutchTutorial But it wont for me. Running Solr without nutch is no problem. I can post documents to solr with post.jar. But what I want to do is post my nutch crawl to solr. Now if I copy the schema.xml from nutch to apache-solr-4.0.0/example/solr/collection1/conf directory aned restart solr (java -jar start.jar), I get compiling errors but Solr will start. (Is this the correct directory to copy my schema?) Nov 8, 2012 9:40:33 AM org.apache.solr.schema.IndexSchema readSchema INFO: Schema name=nutch Nov 8, 2012 9:40:33 AM org.apache.solr.core.CoreContainer create SEVERE: Unable to create core: collection1 org.apache.solr.common.SolrException: Schema Parsing Failed: multiple points at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:571) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:113) ... Nov 8, 2012 9:40:33 AM org.apache.solr.common.SolrException log SEVERE: null:org.apache.solr.common.SolrException: Schema Parsing Failed: multiple points at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:571) at org.apache.solr.schema.IndexSchema.init(IndexSchema.java:113) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:846) ... Now if I don't copy the schema and push my nutch crawl to solr I get following error: SolrIndexer: starting at 2012-11-08 10:49:02 Indexing 5 documents java.io.IOException: Job failed! SolrDeleteDuplicates: starting at 2012-11-08 10:49:47 SolrDeleteDuplicates: Solr url: http://photon:8983/solr/ And this is taken from the logging: org.apache.solr.common.SolrException: ERROR: [doc= http://e-docs/infrastructure/cpuload_monitor.html] unknown field 'host' What should I do or what am I missing? I hope you can help me Best Regards Antony
Re: custom request handler
Hi Lee, So the query component would be a subclass of SearchComponent and you can define the list of components executed during a search handler. http://wiki.apache.org/solr/SearchComponent I *think* you can have a custom component do what you want as long as it's the first component in the list so you can inspect and re-set the parameters before it goes downstream to the other components. However, it's still not clear how you are going to prevent users from POSTing bad queries or looking at things they probably shouldn't be like the schema.xml or solrconfig.xml or the admin console. Maybe there are ways in Solr to prevent this but then you'd have to allow it for internal admins but exclude it for the public. If you are exposing your slaves to the actual world wide public then I'd strongly suggest an app layer between solr and the public. I treat Solr like my database meaning that I don't expose access to my database publicly but rather through some app layer (say some CMS tools or what not). HTH! Amit On Sun, Nov 11, 2012 at 5:23 AM, Lee Carroll lee.a.carr...@googlemail.comwrote: Only slaves are public facing and they are read only, with limited query request handlers defined. The above approach is to prevent abusive / in appropriate queries by clients. A query component sounds interesting would this be implemented through an interface so could be separate from solr or would it be sub classing a base component ? cheers lee c On 9 November 2012 17:24, Amit Nithian anith...@gmail.com wrote: Lee, I guess my question was if you are trying to prevent the big bad world from doing stuff they aren't supposed to in Solr, how are you going to prevent the big bad world from POSTing a delete all query? Or restrict them from hitting the admin console, looking at the schema.xml, solrconfig.xml. I guess the question here is who is the big bad world? The internet at large or employees/colleagues in your organization? If it's the internet at large then I'd totally decouple this from Solr b/c I want to be 100% sure that the *only* thing that the internet has access to is a GET on /select with some restrictions and this could be done in many places but it's not clear that coupling this to Solr is the place to do it. If the big bad world is just within your organization and you want some basic protections around what they can and can't see then what you are saying is reasonable to me. Also perhaps another option is to consider a query component rather than creating a sublcass of the request handler as a query component promotes more re-use and flexibility. You could make the necessary parameter changes in the prepare() method and just make sure that this safe parameter component comes before the query component in the list of components for a handler and you should be fine. Cheers! Amit On Fri, Nov 9, 2012 at 5:39 AM, Lee Carroll lee.a.carr...@googlemail.com wrote: Hi Amit I did not do this via a servlet filter as I wanted the solr devs to be concerned with solr config and keep them out of any concerns of the container. By specifying declarative data in a request handler that would be enough to produce a service uri for an application. Or have I missed a point ? We have several cores with several apps all with different data query needs. Maybe 20 request handlers needed to support this with active development on going. Basically I want it easy for devs to create a specific request handler suited to their needs. I thought a servletfilter developed and mainatined every time would be over kill. Again though I may have missed a point / over emphasised a difficulty? Are you saying my custom request handler is to tightly bound to solr? so the parameters my apps talk is not de-coupled enough from solr? Lee C On 7 November 2012 19:49, Amit Nithian anith...@gmail.com wrote: Why not do this in a ServletFilter? Alternatively, I'd just write a front end application servlet to do this so that you don't firewall your internal admins off from accessing the core Solr admin pages. I guess you could solve this using some form of security but I don't know this well enough. If I were to restrict access to certain parts of Solr, I'd do this outside of Solr itself and do this in a servlet or a filter, inspecting the parameters. It's easy to create a modifiable parameters class and populate that with acceptable parameters before the Solr filter operates on it. HTH Amit
Re: Preventing accepting queries while custom QueryComponent starts up?
Jack, I think the issue is that the ping which is used to determine whether or not the server is live returns a seemingly false positive back to the load balancer (and indirectly the client) that this server is ready to go when in fact it's not. Reading this page ( http://wiki.apache.org/solr/SolrConfigXml), it does seem to be documented to do this but it may not be fully stressed to hide your Solr behind a load balancer. I am more than happy to write up a post that, in my opinion at least, stresses some best practices on the use of Solr based on my experience if others find this useful. What seems odd here is that the ping is a query so maybe the ping query in the solrconfig (for Aaron and others having this) should be configured to hit the handler that is used by the front end app so that while that handler is warming up the ping query will be blocked. Of course using the load balancer means that the app layer knows nothing about servers in and out of rotation. Cheers! Amit On Sun, Nov 11, 2012 at 8:05 AM, Jack Krupansky j...@basetechnology.comwrote: Is the issue here that the Solr node is continuously live with the load balancer so that the moment during startup that Solr can respond to anything, the load balancer will be sending it traffic and that this can occur while Solr is still warming up? First, shouldn't we be encouraging people to have an app layer between Solr and the outside world? If so, the app layer should simply not respond to traffic until the app layer can verified that Solr has stabilized. If not, then maybe we do need to suggest a change to Solr so that the developer can control exactly when Solr becomes live and responsive to incoming traffic. At a minimum, we should document when that moment is today in terms of an explicit contract. It sounds like the problem is that the contract is either nonexistent, vague, ambiguous, non-deterministic, or whatever. -- Jack Krupansky -Original Message- From: Amit Nithian Sent: Saturday, November 10, 2012 4:24 PM To: solr-user@lucene.apache.org Subject: Re: Preventing accepting queries while custom QueryComponent starts up? Yeah that's what I was suggesting in my response too. I don't think your load balancer should be doing this but whatever script does the release (restarting the container) should do this so that when the ping is enabled the warming has finished. On Sat, Nov 10, 2012 at 3:33 PM, Erick Erickson erickerick...@gmail.com* *wrote: Hmmm, rather than hit the ping query, why not just send in a real query and only let the queued ones through after the response? Just a random thought Erick On Sat, Nov 10, 2012 at 2:53 PM, Amit Nithian anith...@gmail.com wrote: Yes but the problem is that if user facing queries are hitting a server that is warming up and isn't being serviced quickly, then you could potentially bring down your site if all the front end threads are blocked on Solr queries b/c those queries are waiting (presumably at the container level since the filter hasn't finished its init() sequence) for the warming to complete (this is especially notorious when your front end is rails). This is why your ping to enable/disable a server from the load balancer has to be accurate with regards to whether or not a server is truly ready and warm. I think what I am gathering from this discussion is that the server is warming up, the ping is going through and tells the load balancer this server is ready, user queries are hitting this server and are queued waiting for the firstSearcher to finish (say these initial user queries are to respond in 500-1000ms) that's terrible for performance. Alternatively, if you have a bunch of servers behind a load balancer, you want this one server (or block of servers depending on your deployment) to be reasonably sure that user queries will return in a decent time (whatever you define decent to be) hence why this matters. Let me know if I am missing anything. Thanks Amit On Sat, Nov 10, 2012 at 10:03 AM, Erick Erickson erickerick...@gmail.com wrote: Why does it matter? The whole idea of firstSearcher queries is to warm up your system as fast as possible. The theory is that upon restarting the server, let's bet this stuff going immediately... They were never intended (as far as I know) to complete before any queries were handled. As an aside, I'm not quite sure I understand why pings during the warmup are a problem. But anyway. firstSearcher is particularly relevant because the autowarmCount settings on your caches are irrelevant when starting the server, there's no history to autowarm But, there's no good reason _not_ to let queries through while firstSearcher is doing it's tricks, they just get into the queue and are served as quickly as they may. That might be some time since, as you say, they may not get serviced
Re: 4.0 query question
Why not group by cid using the grouping component, within the group sort by version descending and return 1 result per group. http://wiki.apache.org/solr/FieldCollapsing Cheers Amit On Fri, Nov 9, 2012 at 2:56 PM, dm_tim dm_...@yahoo.com wrote: I think I may have found my answer buy I'd like additional validation: I believe that I can add a function to my query to get only the highest values of 'file_version' like this - _val_:max(file_version, 1) I seem to be getting the results I want. Does this look correct? Regards, Tim -- View this message in context: http://lucene.472066.n3.nabble.com/4-0-query-question-tp4019397p4019426.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: how to sort the solr suggester's result
anyone can help to tell me where is my mistake? eyun From: eyun Date: 2012-11-12 11:24 To: solr-user-subscribe Subject: how to sort the solr suggester's result following is my config , it suggests words well . i want to get a sorted result when it suggest, so i added a transformer , it will add a tab(\t) separated float weight string to the end of the Suggestion field , but the suggestion result still does't sorted correctly. my suggest result( note the red rectangle is the weight) schema.xml field name=Suggestion type=string indexed=true stored=true/ solrconfig.xml searchComponent class=solr.SpellCheckComponent name=suggest lst name=spellchecker str name=namesuggest/str str name=fieldSuggestion/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str !-- float name=threshold0.0001/float -- str name=spellcheckIndexDirspellchecker/str str name=comparatorClassfreq/str str name=buildOnCommittrue/str /lst /searchComponent requestHandler class=org.apache.solr.handler.component.SearchHandler name=/suggest lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarysuggest/str str name=spellcheck.count10/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.collatetrue/str /lst arr name=components strsuggest/str /arr /requestHandler eyun
how to sort the solr suggester's result
following is my config , it suggests words well . i want to get a sorted result when it suggest, so i added a transformer , it will add a tab(\t) separated float weight string to the end of the Suggestion field , but the suggestion result still does't sorted correctly. my suggest result( note the float number at the end is the weight) lst name=spellcheck lst name=suggestions lst name=我 int name=numFound10/int int name=startOffset1/int int name=endOffset2/int arr name=suggestion str我脑中的橡皮擦 2.12/str str我老婆是大佬3 2.07/str str我老婆是大佬2 2.12/str schema.xml field name=Suggestion type=string indexed=true stored=true/ solrconfig.xml searchComponent class=solr.SpellCheckComponent name=suggest lst name=spellchecker str name=namesuggest/str str name=fieldSuggestion/str str name=classnameorg.apache.solr.spelling.suggest.Suggester/str str name=lookupImplorg.apache.solr.spelling.suggest.tst.TSTLookup/str !-- float name=threshold0.0001/float -- str name=spellcheckIndexDirspellchecker/str str name=comparatorClassfreq/str str name=buildOnCommittrue/str /lst /searchComponent requestHandler class=org.apache.solr.handler.component.SearchHandler name=/suggest lst name=defaults str name=spellchecktrue/str str name=spellcheck.dictionarysuggest/str str name=spellcheck.count10/str str name=spellcheck.onlyMorePopulartrue/str str name=spellcheck.collatetrue/str /lst arr name=components strsuggest/str /arr /requestHandler -- eyun The truth, whether or not Q:276770341 G+:eyun...@gmail.com
RE: sort by function error
more information, problem only happends when I have both sort by function and grouping in query. From: Kuai, Ben [ben.k...@sensis.com.au] Sent: Monday, November 12, 2012 2:12 PM To: solr-user@lucene.apache.org Subject: sort by function error Hi I am trying to use sort by function something like sort=sum(field1, field2) asc But it is not working and I got error SortField needs to be rewritten through Sort.rewrite(..) and SortField.rewrite Please shed me some light on this. Thanks Ben Full exception stack track: SEVERE: java.lang.IllegalStateException: SortField needs to be rewritten through Sort.rewrite(..) and SortField.rewrite(..) at org.apache.lucene.search.SortField.getComparator(SortField.java:484) at org.apache.lucene.search.grouping.AbstractFirstPassGroupingCollector.init(AbstractFirstPassGroupingCollector.java:82) at org.apache.lucene.search.grouping.TermFirstPassGroupingCollector.init(TermFirstPassGroupingCollector.java:58) at org.apache.solr.search.Grouping$TermFirstPassGroupingCollectorJava6.init(Grouping.java:1009) at org.apache.solr.search.Grouping$CommandField.createFirstPassCollector(Grouping.java:632) at org.apache.solr.search.Grouping.execute(Grouping.java:301) at org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:373) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:201) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:225) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:472) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1001) at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:585) at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:310) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: customize solr search/scoring for performance
Yes, we only need term overlap information to choose top candidates (we may incorporate boost factor for different terms later but that's another story). we are quite new to solr so haven't really profiled the process. Is there any rough guess on what could be expected latency from such cases? our throughput is only around 100 qps so that might not be a significant factor here. Thanks, Jeremy Otis Gospodnetic-5 wrote Fuzzy answer: Can you verify the bottleneck, especially in slow cases is indeed scoring? Profiler? Not sure if coord method in Similarity is still around... are you saying you need just term overlap for scoring/ordering? 20m small docs and 2s queries on good hardware sounds suspicious ... do slow queries correspond to GC or something else? Otis -- Performance Monitoring - http://sematext.com/spm -- View this message in context: http://lucene.472066.n3.nabble.com/customize-solr-search-scoring-for-performance-tp4019444p4019675.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: zkcli issues
Nick - I believe you're experiencing a difficulties with SolrCloud CLI commands for interacting ZooKeeper. Please have a look on below links, it will provide you direction. Handy SolrCloud ZkCLI Commands Uploading Solr Configuration into ZooKeeper ensemble Cheers, Jeeva On Nov 12, 2012, at 4:45 AM, Mark Miller markrmil...@gmail.com wrote: On 11/11/2012 04:47 PM, Yonik Seeley wrote: On Sun, Nov 11, 2012 at 10:39 PM, Nick Chase nch...@earthlink.net wrote: So I'm trying to use ZkCLI without success. I DID start and stop Solr in non-cloud mode, so everything is extracted and it IS finding zookeeper*.jar. However, now it's NOT finding SolrJ. Not sure about your specific problem in this case, but I chatted with Mark about this while at ApacheCon... it seems like we should be able to explode the WAR ourselves if necessary, eliminating the need to start Solr first. Just throwing it out there before I forgot about it ;-) -Yonik http://lucidworks.com I guess the tricky part might be knowing where to extract it. We know how to do it for the default jetty setup, but that could be reconfigured or you could be using another web container. Kind of annoying. - Mark
Re: zkcli issues
Nick - Sorry, embedded links are not shown in previous email. I'm mentioning below. Handy SolrCloud ZkCLI Commands (http://www.myjeeva.com/2012/10/solrcloud-cluster-single-collection-deployment/#handy-solrcloud-cli-commands) Uploading Solr Configuration into ZooKeeper ensemble (http://www.myjeeva.com/2012/10/solrcloud-cluster-single-collection-deployment/#uploading-solrconfig-to-zookeeper) Cheers, Jeeva On Nov 12, 2012, at 12:48 PM, Jeevanandam Madanagopal je...@myjeeva.com wrote: Nick - I believe you're experiencing a difficulties with SolrCloud CLI commands for interacting ZooKeeper. Please have a look on below links, it will provide you direction. Handy SolrCloud ZkCLI Commands Uploading Solr Configuration into ZooKeeper ensemble Cheers, Jeeva On Nov 12, 2012, at 4:45 AM, Mark Miller markrmil...@gmail.com wrote: On 11/11/2012 04:47 PM, Yonik Seeley wrote: On Sun, Nov 11, 2012 at 10:39 PM, Nick Chase nch...@earthlink.net wrote: So I'm trying to use ZkCLI without success. I DID start and stop Solr in non-cloud mode, so everything is extracted and it IS finding zookeeper*.jar. However, now it's NOT finding SolrJ. Not sure about your specific problem in this case, but I chatted with Mark about this while at ApacheCon... it seems like we should be able to explode the WAR ourselves if necessary, eliminating the need to start Solr first. Just throwing it out there before I forgot about it ;-) -Yonik http://lucidworks.com I guess the tricky part might be knowing where to extract it. We know how to do it for the default jetty setup, but that could be reconfigured or you could be using another web container. Kind of annoying. - Mark
Integrating Solr with Database
Hello, I'm currently working on file management system based on Solr. What I have accomplished now is that I have Solr server and windows client application that runs on different computers. When the client indexes rich document to Solr server remotely, it also uploads the file itself via FTP. So that when anyone searches for the document, he/she can download the raw file from server. What I want to do right now is that whenever the client indexes document and uploads the raw file, database gets update with the pairs of (Document ID in Solr, path of the raw file inside server). So on search result page, instead of giving the direct link of the raw file, I'd like to make server to look up the database based on the Document ID in Solr and return the linked file path. As I'm new to database, Apache, RESTful API, and stuff, I'm not sure how to begin implementing this feature. Any help or starting point would be appreciated. Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/Integrating-Solr-with-Database-tp4019692.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Integrating Solr with Database
On 12 November 2012 13:00, 122jxgcn ywpar...@gmail.com wrote: [...] What I want to do right now is that whenever the client indexes document and uploads the raw file, database gets update with the pairs of (Document ID in Solr, path of the raw file inside server). So on search result page, instead of giving the direct link of the raw file, I'd like to make server to look up the database based on the Document ID in Solr and return the linked file path. This might make sense if you were using Solr to search for the ID of an object in the database with relations to other objects. However, if all you are doing is retrieving the file path/URL, why not index that into Solr, and get it directly from Solr? If you still want to do what you had in mind, you should handle that as part of your indexing process, i.e., update both Solr and the database at the same time, or update the database and index to Solr from there. Regards, Gora
Re: Integrating Solr with Database
This might make sense if you were using Solr to search for the ID of an object in the database with relations to other objects. However, if all you are doing is retrieving the file path/URL, why not index that into Solr, and get it directly from Solr? That's what I'm doing right now but since there are some naming and security issues, I'd like to integrate Solr with database eventually. If you still want to do what you had in mind, you should handle that as part of your indexing process, i.e., update both Solr and the database at the same time I have thought about that, but I could not figure out how to update database when I'm updating Solr. I'm pretty sure database has to be connected with Solr somehow (first difficulty) and database has to be updated remotely with Windows Form Application written in C# (second difficulty) Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/Integrating-Solr-with-Database-tp4019692p4019695.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: More references for configuring Solr
LucidFind collects several sources of information in one searchable archive: http://find.searchhub.org/?q=sort=#%2Fp%3Asolr - Original Message - | From: Dmitry Kan dmitry@gmail.com | To: solr-user@lucene.apache.org | Sent: Sunday, November 11, 2012 2:24:21 AM | Subject: Re: More references for configuring Solr | | Hi, | | here are some resources: | http://wiki.apache.org/solr/ (Solr wiki) | http://lucene.apache.org/solr/books.html (books published on Solr) | | the goes googling on a specific topic. But before reading a book | might not | be a bad idea.. | | -- Dmitry | | On Sat, Nov 10, 2012 at 1:15 PM, FARAHZADI, EMAD | emad.farahz...@netapp.comwrote: | | Dear Sir or Madam, | | ** ** | | I want to use to Solr for my final project in university in part of | searching and indexing. | | I’d be appreciated if you send me more resources or documentations | about | Solr. | | ** ** | | Regards | | ** ** | | ** ** | | | * | | Emad Farahzadi [image: brand-site-home-telescope-160x95]* | | *Professional Services Consultant* | | *NetApp Middle-East* | | * | **Office: +971 4 4466203 | Cell:+971 50 9197237* | | ** ** | | *NetApp MEA (Middle East Africa) | Office No. 214 | Building 2, 2nd Floor | Dubai Internet City | P.O. Box 500199 | Dubai, U.A.E. * | | [image: netapp-cloud-esig-dollar] | | ** ** | | | | | -- | Regards, | | Dmitry Kan |