Query across multiple shards - key fields have different names
Hi, Sorry for the basic question - I can't get to the WiKi to find the answer. Version Solr 3.3.0 I have two separate indexes (currently in two cores but can be moved to shards) One core holds metadata about educational resources, the other usage statistics They have a common value named id in one core and search.resourceid in the other core. How can I construct a shard query (once I have moved one the cores to a different node) so that I can effectively get the statistics for each educational resource grouped by each resource? This is an offline reporting job that needs to list the usage events for educational resources over a time period (the usage events have a date/time field. Regards, Ben -- Dr Ben Ryan Jorum Technical Manager 5.12 Roscoe Building The University of Manchester Oxford Road Manchester M13 9PL Tel: 0160 275 6039 E-mail: benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk --
Re: commit in solr4 takes a longer time
Hi sandeep, I made the changes u mentioned and tested again for the same set of docs but unfortunately the commit time increased. -- View this message in context: http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060622.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: commit in solr4 takes a longer time
Hi Gopal, I added the opensearcher parameter as mentioned by you but on checking logs I found that apensearcher was still true on commit. it is only when I removed the autosoftcommit parameter the opensearcher parameter worked and provided faster updates as well. however I require soft commit in my application. Any suggestions. -- View this message in context: http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060623.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: socket write error
Digging in further, found this in HttpCommComponent class: [code] static { MultiThreadedHttpConnectionManager mgr = new MultiThreadedHttpConnectionManager(); mgr.getParams().setDefaultMaxConnectionsPerHost(20); mgr.getParams().setMaxTotalConnections(1); mgr.getParams().setConnectionTimeout(SearchHandler.connectionTimeout); mgr.getParams().setSoTimeout(SearchHandler.soTimeout); // mgr.getParams().setStaleCheckingEnabled(false); client = new HttpClient(mgr); } [/code] Could the value set by setDefaultMaxConnectionsPerHost(20) be to small for 80+ shards returning results to the router? Dmitry On Fri, May 3, 2013 at 6:50 AM, Dmitry Kan solrexp...@gmail.com wrote: Hi, thanks. Solr 3.4. There is POST request everywhere, between client and router, router and shards. Do you do faceting across all shards? How many documents approx you have? On 2 May 2013 22:02, Patanachai Tangchaisin patanachai.tangchai...@wizecommerce.com wrote: Hi, First, which version of Solr are you using? I also has 60 shards+ on Solr 4.2.1 and it doesn't seems to be a problem for me. - Make sure you use POST to send a query to Solr. - 'connection reset by peer' from client can indicate that there is something wrong with server e.g. server closes a connection etc. -- Patanachai On 05/02/2013 05:05 AM, Dmitry Kan wrote: After some searching around, I see this: http://search-lucene.com/m/**ErEZUl7P5f2/%2522socket+write+** error%2522subj=Long+list+of+**shards+breaks+solrj+queryhttp://search-lucene.com/m/ErEZUl7P5f2/%2522socket+write+error%2522subj=Long+list+of+shards+breaks+solrj+query Seems like this has happened in the past with large amount of shards. To make it clear: the distributed search works with 20 shards. On Thu, May 2, 2013 at 1:57 PM, Dmitry Kan solrexp...@gmail.com wrote: Hi guys! We have solr router and shards. I see this in jetty log on the router: May 02, 2013 1:30:22 PM org.apache.commons.httpclient.** HttpMethodDirector executeWithRetry INFO: I/O exception (java.net.SocketException) caught when processing request: Connection reset by peer: socket write error and then: May 02, 2013 1:30:22 PM org.apache.commons.httpclient.** HttpMethodDirector executeWithRetry INFO: Retrying request followed by exception about Internal Server Error any ideas why this happens? We run 80+ shards distributed across several servers. Router runs on its own node. Is there anything in particular I should be looking into wrt ubuntu socket settings? Is this a known issue for solr's distributed search from the past? Thanks, Dmitry CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: commit in solr4 takes a longer time
That's not ideal. Can you post solrconfig.xml? On 3 May 2013 07:41, vicky desai vicky.de...@germinait.com wrote: Hi sandeep, I made the changes u mentioned and tested again for the same set of docs but unfortunately the commit time increased. -- View this message in context: http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060622.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: commit in solr4 takes a longer time
My solrconfig.xml is as follows ?xml version=1.0 encoding=UTF-8 ? config luceneMatchVersionLUCENE_40/luceneMatchVersion indexConfig maxFieldLength2147483647/maxFieldLength lockTypesimple/lockType unlockOnStartuptrue/unlockOnStartup /indexConfig updateHandler class=solr.DirectUpdateHandler2 autoSoftCommit maxDocs500/maxDocs maxTime1000/maxTime /autoSoftCommit autoCommit maxDocs5/maxDocs maxTime30/maxTime openSearcherfalse/openSearcher /autoCommit /updateHandler requestDispatcher handleSelect=true requestParsers enableRemoteStreaming=false multipartUploadLimitInKB=204800 / /requestDispatcher requestHandler name=standard class=solr.StandardRequestHandler default=true / requestHandler name=/update class=solr.UpdateRequestHandler / requestHandler name=/admin/ class=org.apache.solr.handler.admin.AdminHandlers / requestHandler name=/replication class=solr.ReplicationHandler / directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory} / enableLazyFieldLoadingtrue/enableLazyFieldLoading admin defaultQuery*:*/defaultQuery /admin /config -- View this message in context: http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060628.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: What Happens to Consistency if I kill a Leader and Startup it again?
Shawn thanks for detailed answer, it explains everything. I think that there is no problem. I will use 4.3. when it is available and if I see a situation something like that I will report. 2013/5/3 Shawn Heisey s...@elyograg.org On 5/2/2013 2:19 PM, Furkan KAMACI wrote: I see that at my admin page: Replication (Slave) Version GenSize Master: 1367307652512 82 778.04 MB Slave: 1367307658862 82 781.05 MB and I started to figure about it so that's why I asked this question. As we've been trying to tell you, the sizes can (and will) be different between replicas on SolrCloud. Also, if you're not running a recent release candidate of 4.3, then the version numbers on the replication screen are misleading. See SOLR-4661 for more details. Your example of version numbers like 100, 90, and 95 wouldn't actually happen, because the version number is based on the current time in milliseconds since 1970-01-01 00:00:00 UTC. If you index after killing the leader, the new leader's version number will be higher than the offline replica. If you can find actual proof of a problem with index updates related to killing the leader, then we can take the bug report and work on fixing it. Here's how you would go about finding proof. It would be easiest to have one shard, but if you want to make sure it's OK with multiple shards, you would have to kill all the leaders. * Start with a functional collection with two replicas. * Index a document with a recognizable ID like A. * Make sure you can find document A. * Kill the leader replica, let's say it was replica1. * Make sure replica2 becomes leader. * Make sure you can find document A. * Index document B. * Start replica1, wait for it to turn green. * Make sure you can still find document B. * Kill the leader again, this time it's replica2. * Make sure you can still find document B. To my knowledge, nobody has reported a real problem with proof. I would imagine that more than one person has done testing like this to make sure that SolrCloud is reliable. Thanks, Shawn
Re: Does Near Real Time get not supported at SolrCloud?
Does soft commits distributes into nodes of SolrCloud? 2013/5/3 Otis Gospodnetic otis.gospodne...@gmail.com NRT works with SolrCloud. Otis Solr ElasticSearch Support http://sematext.com/ On May 2, 2013 5:34 AM, Furkan KAMACI furkankam...@gmail.com wrote: Does Near Real Time get not supported at SolrCloud? I mean when a soft commit occurs at a leader I think that it doesn't distribute it to replicas(because it is not at storage, does indexes at RAM distributes to replicas too?) and a search query comes what happens?
Re: Rearranging Search Results of a Search?
I think this looks like what I search for: https://issues.apache.org/jira/browse/SOLR-4465 How about post filter for Lucene, can it help me for my purpose? 2013/5/3 Otis Gospodnetic otis.gospodne...@gmail.com Hi, You should use search more often :) http://search-lucene.com/?q=scriptable+collectorsort=newestOnTopfc_project=Solrfc_type=issue Coincidentally, what you see there happens to be a good example of a Solr component that does something behind the scenes to deliver those search results even though my original query was bad. Knd of similar to what you are after. Otis -- Solr ElasticSearch Support http://sematext.com/ On Thu, May 2, 2013 at 4:47 PM, Furkan KAMACI furkankam...@gmail.com wrote: I know that I can use boosting at query for a field, for a searching term, at solrconfig.xml and query elevator so I can arrange the results of a search. However after I get top documents how can I change the order of a results? Does Lucene's postfilter stands for that?
Re: Delete from Solr Cloud 4.0 index..
Thanks Shawn. I have played around with Soft Commits before and didn't seem to have any improvement, but with the current load testing I am doing I will give it another go. I have researched docValues and came across the fact that it would increase the index size. With the upgrade to 4.2.1 the index size has reduced by approx 33% which is pleasing and I don't really want to lose that saving. We do use the facet.enum method - which works really well, but I will verify that we are using that in every instance, we have numerous developers working on the product and maybe one or two have slipped through. Right from the first I upped the zkClientTimeout to 30 as I wanted to give extra time for any network blips that we experience on AWS. We only seem to drop communication on a full garbage collection though. I am coming to the conclusion that we need to have more shards to cope with the writes, so I will play around with adding more shards and see how I go. I appreciate you having a look over our setup and the advice. Thanks again. Netty. On 2 May 2013 23:17, Shawn Heisey s...@elyograg.org wrote: On 5/2/2013 4:24 AM, Annette Newton wrote: Hi Shawn, Thanks so much for your response. We basically are very write intensive and write throughput is pretty essential to our product. Reads are sporadic and actually is functioning really well. We write on average (at the moment) 8-12 batches of 35 documents per minute. But we really will be looking to write more in the future, so need to work out scaling of solr and how to cope with more volume. Schema (I have changed the names) : http://pastebin.com/x1ry7ieW Config: http://pastebin.com/pqjTCa7L This is very clean. There's probably more you could remove/comment, but generally speaking I couldn't find any glaring issues. In particular, you have disabled autowarming, which is a major contributor to commit speed problems. The first thing I think I'd try is increasing zkClientTimeout to 30 or 60 seconds. You can use the startup commandline or solr.xml, I would probably use the latter. Here's a solr.xml fragment that uses a system property or a 15 second default: ?xml version=1.0 encoding=UTF-8 ? solr persistent=true sharedLib=lib cores adminPath=/admin/cores zkClientTimeout=${zkClientTimeout:15000} hostPort=${jetty.port:} hostContext=solr General thoughts, these changes might not help this particular issue: You've got autoCommit with openSearcher=true. This is a hard commit. If it were me, I would set that up with openSearcher=false and either do explicit soft commits from my application or set up autoSoftCommit with a shorter timeframe than autoCommit. This might simply be a scaling issue, where you'll need to spread the load wider than four shards. I know that there are financial considerations with that, and they might not be small, so let's leave that alone for now. The memory problems might be a symptom/cause of the scaling issue I just mentioned. You said you're using facets, which can be a real memory hog even with only a few of them. Have you tried facet.method=enum to see how it performs? You'd need to switch to it exclusively, never go with the default of fc. You could put that in the defaults or invariants section of your request handler(s). Another way to reduce memory usage for facets is to use disk-based docValues on version 4.2 or later for the facet fields, but this will increase your index size, and your index is already quite large. Depending on your index contents, the increase may be small or large. Something to just mention: It looks like your solrconfig.xml has hard-coded absolute paths for dataDir and updateLog. This is fine if you'll only ever have one core/collection on each server, but it'll be a disaster if you have multiples. I could be wrong about how these get interpreted in SolrCloud -- they might actually be relative despite starting with a slash. Thanks, Shawn -- Annette Newton Database Administrator ServiceTick Ltd T:+44(0)1603 618326 Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ www.servicetick.com *www.sessioncam.com* -- *This message is confidential and is intended to be read solely by the addressee. The contents should not be disclosed to any other person or copies taken unless authorised to do so. If you are not the intended recipient, please notify the sender and permanently delete this message. As Internet communications are not secure ServiceTick accepts neither legal responsibility for the contents of this message nor responsibility for any change made to this message after it was forwarded by the original author.*
Re: Delete from Solr Cloud 4.0 index..
One question Shawn - did you ever get any costings around Zing? Did you trial it? Thanks. On 3 May 2013 10:03, Annette Newton annette.new...@servicetick.com wrote: Thanks Shawn. I have played around with Soft Commits before and didn't seem to have any improvement, but with the current load testing I am doing I will give it another go. I have researched docValues and came across the fact that it would increase the index size. With the upgrade to 4.2.1 the index size has reduced by approx 33% which is pleasing and I don't really want to lose that saving. We do use the facet.enum method - which works really well, but I will verify that we are using that in every instance, we have numerous developers working on the product and maybe one or two have slipped through. Right from the first I upped the zkClientTimeout to 30 as I wanted to give extra time for any network blips that we experience on AWS. We only seem to drop communication on a full garbage collection though. I am coming to the conclusion that we need to have more shards to cope with the writes, so I will play around with adding more shards and see how I go. I appreciate you having a look over our setup and the advice. Thanks again. Netty. On 2 May 2013 23:17, Shawn Heisey s...@elyograg.org wrote: On 5/2/2013 4:24 AM, Annette Newton wrote: Hi Shawn, Thanks so much for your response. We basically are very write intensive and write throughput is pretty essential to our product. Reads are sporadic and actually is functioning really well. We write on average (at the moment) 8-12 batches of 35 documents per minute. But we really will be looking to write more in the future, so need to work out scaling of solr and how to cope with more volume. Schema (I have changed the names) : http://pastebin.com/x1ry7ieW Config: http://pastebin.com/pqjTCa7L This is very clean. There's probably more you could remove/comment, but generally speaking I couldn't find any glaring issues. In particular, you have disabled autowarming, which is a major contributor to commit speed problems. The first thing I think I'd try is increasing zkClientTimeout to 30 or 60 seconds. You can use the startup commandline or solr.xml, I would probably use the latter. Here's a solr.xml fragment that uses a system property or a 15 second default: ?xml version=1.0 encoding=UTF-8 ? solr persistent=true sharedLib=lib cores adminPath=/admin/cores zkClientTimeout=${zkClientTimeout:15000} hostPort=${jetty.port:} hostContext=solr General thoughts, these changes might not help this particular issue: You've got autoCommit with openSearcher=true. This is a hard commit. If it were me, I would set that up with openSearcher=false and either do explicit soft commits from my application or set up autoSoftCommit with a shorter timeframe than autoCommit. This might simply be a scaling issue, where you'll need to spread the load wider than four shards. I know that there are financial considerations with that, and they might not be small, so let's leave that alone for now. The memory problems might be a symptom/cause of the scaling issue I just mentioned. You said you're using facets, which can be a real memory hog even with only a few of them. Have you tried facet.method=enum to see how it performs? You'd need to switch to it exclusively, never go with the default of fc. You could put that in the defaults or invariants section of your request handler(s). Another way to reduce memory usage for facets is to use disk-based docValues on version 4.2 or later for the facet fields, but this will increase your index size, and your index is already quite large. Depending on your index contents, the increase may be small or large. Something to just mention: It looks like your solrconfig.xml has hard-coded absolute paths for dataDir and updateLog. This is fine if you'll only ever have one core/collection on each server, but it'll be a disaster if you have multiples. I could be wrong about how these get interpreted in SolrCloud -- they might actually be relative despite starting with a slash. Thanks, Shawn -- Annette Newton Database Administrator ServiceTick Ltd T:+44(0)1603 618326 Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ www.servicetick.com *www.sessioncam.com* -- Annette Newton Database Administrator ServiceTick Ltd T:+44(0)1603 618326 Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ www.servicetick.com *www.sessioncam.com* -- *This message is confidential and is intended to be read solely by the addressee. The contents should not be disclosed to any other person or copies taken unless authorised to do so. If you are not the intended recipient, please notify the sender and permanently delete this message. As Internet communications are not secure ServiceTick accepts neither legal responsibility for the contents of
Performance considerations when using distributed indexing + loadbalancing with Solr cloud
Hi all, I have been playing with Solr Cloud recently and am enjoying the distributed indexing capability. At the moment my SolrCloud consists of 2 leaders and 2 replicas which are fronted by an HAProxy instance. I want to maximise performance for indexing and it occurred to me that the model I use for loadbalancing my indexing requests may impact performance. i.e. am I likely to see better indexing performance if I stick certain groups of requests to certain nodes vs simply using a round robin approach? I'll be doing some impirical testing to try and figure this out but was wondering if there's any general guidance here? Or if anyone has any experience of particularly good/ bad configurations? Many thanks, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Good Desktop Search?
Hi everybody, just a simple question is there any solr/lucene based desktop search project around someone might recommend? I am looking for something for personal use that is kind of mature, at least stable, runs on java and does not require admin rights to install. Nothing too fancy. Thanks/S.
Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud
Do you use CloudSolrServer when you push documnts into SolrCloud to be indexed? 2013/5/3 Edd Grant e...@eddgrant.com Hi all, I have been playing with Solr Cloud recently and am enjoying the distributed indexing capability. At the moment my SolrCloud consists of 2 leaders and 2 replicas which are fronted by an HAProxy instance. I want to maximise performance for indexing and it occurred to me that the model I use for loadbalancing my indexing requests may impact performance. i.e. am I likely to see better indexing performance if I stick certain groups of requests to certain nodes vs simply using a round robin approach? I'll be doing some impirical testing to try and figure this out but was wondering if there's any general guidance here? Or if anyone has any experience of particularly good/ bad configurations? Many thanks, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Re: Good Desktop Search?
Savia, maybe not very mature yet, but someone on java-us...@lucene.apache.org announced such a tool the other day. I'm copying it below. I do not know of many otherwise. paul Hi everybody, just a simple question is there any solr/lucene based desktop search project around someone might recommend? I am looking for something for personal use that is kind of mature, at least stable, runs on java and does not require admin rights to install. Nothing too fancy. Begin forwarded message: From: Mirko Sertic mirko.ser...@web.de Date: 29 avril 2013 21:20:19 HAEC To: java-u...@lucene.apache.org Subject: Lucene Desktop Search Engine with JavaFX/Tika/Filesystem Crawler/HTML5 Reply-To: java-u...@lucene.apache.org Hi@all Lucene rocks, and based on some JavaFX/HTML5 hyprids i built a small Java search engine for your desktop! The prototype and the result can be seen here: http://www.mirkosertic.de/doku.php/javastuff/fxdesktopsearch I am using a multithreaded pipes and filters architecture with Tika as the content extraction framework and of course Lucene as the fulltext engine. It really rocks, i can search thousands of documents with syntax highlighting within a few milliseconds. It also supports MoreLikeThis queries showing document similarities. Thanks @all working on Lucene! I am planning future releases of the desktop search engine with facetted search based on tika-extracted document metadata. Also NLP with named entity extraction might be a usecase, so everyone who is willing to contribute is very welcome. Sourcecode is OSS and hosted on Google Code here: http://code.google.com/p/freedesktopsearch/ Regards Mirko - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Good Desktop Search?
Thanks Paul, I missed that one. On May 3, 2013, at 2:27 PM, Paul Libbrecht p...@hoplahup.net wrote: Savia, maybe not very mature yet, but someone on java-us...@lucene.apache.org announced such a tool the other day. I'm copying it below. I do not know of many otherwise. paul Hi everybody, just a simple question is there any solr/lucene based desktop search project around someone might recommend? I am looking for something for personal use that is kind of mature, at least stable, runs on java and does not require admin rights to install. Nothing too fancy. Begin forwarded message: From: Mirko Sertic mirko.ser...@web.de Date: 29 avril 2013 21:20:19 HAEC To: java-u...@lucene.apache.org Subject: Lucene Desktop Search Engine with JavaFX/Tika/Filesystem Crawler/HTML5 Reply-To: java-u...@lucene.apache.org Hi@all Lucene rocks, and based on some JavaFX/HTML5 hyprids i built a small Java search engine for your desktop! The prototype and the result can be seen here: http://www.mirkosertic.de/doku.php/javastuff/fxdesktopsearch I am using a multithreaded pipes and filters architecture with Tika as the content extraction framework and of course Lucene as the fulltext engine. It really rocks, i can search thousands of documents with syntax highlighting within a few milliseconds. It also supports MoreLikeThis queries showing document similarities. Thanks @all working on Lucene! I am planning future releases of the desktop search engine with facetted search based on tika-extracted document metadata. Also NLP with named entity extraction might be a usecase, so everyone who is willing to contribute is very welcome. Sourcecode is OSS and hosted on Google Code here: http://code.google.com/p/freedesktopsearch/ Regards Mirko - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Solr 4 reload failed core
Hi I have a multi-core installation, with 2 cores. Sometimes, when Solr starts up, one of the cores fails (due to an extension to Solr I have, which is waiting on an external service which has yet to initialise). In previous versions of Solr, I could subsequently issue a RELOAD to this core, even though it was in a fail state, and it would reload and start up. Now it seems with Solr 4, I cannot issue a RELOAD to a core which has failed. Is this the case? How can I get Solr to start a core which failed on initial start up? Thanks, Peter
Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud
Hi, No we're actually POSTing them over plain old http. Our feeder process simply points at the HAProxy box and posts merrily away. Cheers, Edd On 3 May 2013 13:17, Furkan KAMACI furkankam...@gmail.com wrote: Do you use CloudSolrServer when you push documnts into SolrCloud to be indexed? 2013/5/3 Edd Grant e...@eddgrant.com Hi all, I have been playing with Solr Cloud recently and am enjoying the distributed indexing capability. At the moment my SolrCloud consists of 2 leaders and 2 replicas which are fronted by an HAProxy instance. I want to maximise performance for indexing and it occurred to me that the model I use for loadbalancing my indexing requests may impact performance. i.e. am I likely to see better indexing performance if I stick certain groups of requests to certain nodes vs simply using a round robin approach? I'll be doing some impirical testing to try and figure this out but was wondering if there's any general guidance here? Or if anyone has any experience of particularly good/ bad configurations? Many thanks, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543 -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud
If you index them with SolrCloudServer, your server will learn where data will go from Zookeeper and send data to that shard leader. However if you use another random processes or something like data will go any of nodes and after that will be routed into the right place within cluster. This extra routing process within cluster may cause unnecessary network traffic and latency for indexing time as well. 2013/5/3 Edd Grant e...@eddgrant.com Hi, No we're actually POSTing them over plain old http. Our feeder process simply points at the HAProxy box and posts merrily away. Cheers, Edd On 3 May 2013 13:17, Furkan KAMACI furkankam...@gmail.com wrote: Do you use CloudSolrServer when you push documnts into SolrCloud to be indexed? 2013/5/3 Edd Grant e...@eddgrant.com Hi all, I have been playing with Solr Cloud recently and am enjoying the distributed indexing capability. At the moment my SolrCloud consists of 2 leaders and 2 replicas which are fronted by an HAProxy instance. I want to maximise performance for indexing and it occurred to me that the model I use for loadbalancing my indexing requests may impact performance. i.e. am I likely to see better indexing performance if I stick certain groups of requests to certain nodes vs simply using a round robin approach? I'll be doing some impirical testing to try and figure this out but was wondering if there's any general guidance here? Or if anyone has any experience of particularly good/ bad configurations? Many thanks, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543 -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Re: commit in solr4 takes a longer time
Hi All, setting opensearcher flag to true solution worked and it give me visible improvement in commit time. One thing to make note of is that while using solrj client we have to call server.commit(false,false) which i was doing incorrectly and hence was not able to see the improvement earliear. Thanks everyone -- View this message in context: http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060688.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Delete from Solr Cloud 4.0 index..
On 5/3/2013 3:22 AM, Annette Newton wrote: One question Shawn - did you ever get any costings around Zing? Did you trial it? I never did do a trial. I asked them for a cost and they didn't have an immediate answer, wanted to do a phone call and get a lot of information about my setup. The price apparently has a lot of variance based on the specific environment, so I didn't pursue it, figuring that the cost would be higher than my superiors are willing to pay. The only information I could find about the cost of Zing was a very recent Register article that had this to say: Azul is similarly cagey about what a supported version of the Zing JVM costs, and only says that Zing costs around what a supported version of an Oracle, IBM, or Red Hat JVM will run enterprises and that it has an annual subscription model for Zing pricing. You can't easily get pricing for Oracle, IBM, or Red Hat JVMs, of course, so the comparison is accurate but perfectly useless. http://www.theregister.co.uk/2013/04/08/azul_systems_zing_lmax_exchange/ Thanks, Shawn
Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud
Thanks, that's exactly what I was worried about. If I take your suggested approach of using SolrCloudServer and the feeder learns which shard leader to target, then if the shard leader goes down midway through indexing then I've lost my ability to index. Whereas if I take the route of making all updates via the HAProxy instance then I've got HA but at the cost of performance. This has me wondering if it might be feasable to address each shard with a VIP? Then if the leader of the shard goes down and a replica is elected as the leader it could also take the VIP, so in essence we'd always be sending messages to the leader. Anyone tried anything like this? Cheers, Edd On 3 May 2013 15:22, Furkan KAMACI furkankam...@gmail.com wrote: If you index them with SolrCloudServer, your server will learn where data will go from Zookeeper and send data to that shard leader. However if you use another random processes or something like data will go any of nodes and after that will be routed into the right place within cluster. This extra routing process within cluster may cause unnecessary network traffic and latency for indexing time as well. 2013/5/3 Edd Grant e...@eddgrant.com Hi, No we're actually POSTing them over plain old http. Our feeder process simply points at the HAProxy box and posts merrily away. Cheers, Edd On 3 May 2013 13:17, Furkan KAMACI furkankam...@gmail.com wrote: Do you use CloudSolrServer when you push documnts into SolrCloud to be indexed? 2013/5/3 Edd Grant e...@eddgrant.com Hi all, I have been playing with Solr Cloud recently and am enjoying the distributed indexing capability. At the moment my SolrCloud consists of 2 leaders and 2 replicas which are fronted by an HAProxy instance. I want to maximise performance for indexing and it occurred to me that the model I use for loadbalancing my indexing requests may impact performance. i.e. am I likely to see better indexing performance if I stick certain groups of requests to certain nodes vs simply using a round robin approach? I'll be doing some impirical testing to try and figure this out but was wondering if there's any general guidance here? Or if anyone has any experience of particularly good/ bad configurations? Many thanks, Edd -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543 -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543 -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud
On 5/3/2013 8:35 AM, Edd Grant wrote: Thanks, that's exactly what I was worried about. If I take your suggested approach of using SolrCloudServer and the feeder learns which shard leader to target, then if the shard leader goes down midway through indexing then I've lost my ability to index. Whereas if I take the route of making all updates via the HAProxy instance then I've got HA but at the cost of performance. This has me wondering if it might be feasable to address each shard with a VIP? Then if the leader of the shard goes down and a replica is elected as the leader it could also take the VIP, so in essence we'd always be sending messages to the leader. Anyone tried anything like this? CloudSolrServer is part of the SolrJ (Java) API. It incorporates a zookeeper client. To initialize it, you don't tell it about your Solr servers, you give it the same zookeeper host information that you give to Solr when starting in cloud mode. It always knows the current state of the cluster, so if you have a failure, it adjusts so that your queries and updates don't fail. That also means that it will know when servers are added to or removed from the cloud. http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrServer.html Thanks, Shawn
Re: Performance considerations when using distributed indexing + loadbalancing with Solr cloud
Aah I see - very useful. Thanks! On 3 May 2013 15:49, Shawn Heisey s...@elyograg.org wrote: On 5/3/2013 8:35 AM, Edd Grant wrote: Thanks, that's exactly what I was worried about. If I take your suggested approach of using SolrCloudServer and the feeder learns which shard leader to target, then if the shard leader goes down midway through indexing then I've lost my ability to index. Whereas if I take the route of making all updates via the HAProxy instance then I've got HA but at the cost of performance. This has me wondering if it might be feasable to address each shard with a VIP? Then if the leader of the shard goes down and a replica is elected as the leader it could also take the VIP, so in essence we'd always be sending messages to the leader. Anyone tried anything like this? CloudSolrServer is part of the SolrJ (Java) API. It incorporates a zookeeper client. To initialize it, you don't tell it about your Solr servers, you give it the same zookeeper host information that you give to Solr when starting in cloud mode. It always knows the current state of the cluster, so if you have a failure, it adjusts so that your queries and updates don't fail. That also means that it will know when servers are added to or removed from the cloud. http://lucene.apache.org/solr/4_2_1/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrServer.html Thanks, Shawn -- Web: http://www.eddgrant.com Email: e...@eddgrant.com Mobile: +44 (0) 7861 394 543
Re: commit in solr4 takes a longer time
Hi, After using the following config updateHandler class=solr.DirectUpdateHandler2 autoSoftCommit maxDocs500/maxDocs maxTime1000/maxTime /autoSoftCommit autoCommit maxDocs5000/maxDocs openSearcherfalse/openSearcher /autoCommit /updateHandler When a commit operation is fired I am getting the following logs INFO: start commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} even though openSearcher is false , waitSearcher is true . Can that be set to false too? Will that give a performance improvement and what is the config for that -- View this message in context: http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060706.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Does Near Real Time get not supported at SolrCloud?
yes, absolutely - NRT was a big driver for the leader to replica distribution approach in Solr Cloud On Fri, May 3, 2013 at 1:14 AM, Furkan KAMACI furkankam...@gmail.com wrote: Does soft commits distributes into nodes of SolrCloud? 2013/5/3 Otis Gospodnetic otis.gospodne...@gmail.com NRT works with SolrCloud. Otis Solr ElasticSearch Support http://sematext.com/ On May 2, 2013 5:34 AM, Furkan KAMACI furkankam...@gmail.com wrote: Does Near Real Time get not supported at SolrCloud? I mean when a soft commit occurs at a leader I think that it doesn't distribute it to replicas(because it is not at storage, does indexes at RAM distributes to replicas too?) and a search query comes what happens?
Re: Solr metrics in Codahale metrics and Graphite?
Does anybody tested Ganglia with JMXTrans at production environment for SolrCloud? 2013/4/26 Dmitry Kan solrexp...@gmail.com Alan, Shawn, If backporting to 3.x is hard, no worries, we don't necessarily require the patch as we are heading to 4.x eventually. It is just much easier within our organization to test on the existing solr 3.4 as there are a few of internal dependencies and custom code on top of solr. Also solr upgrades on production systems are usually pushed forward by a month or so starting the upgrade on development systems (requires lots of testing and verifications). Nevertheless, it is good effort to make #solr #graphite friendly, so keep it up! :) Dmitry On Thu, Apr 25, 2013 at 9:29 PM, Shawn Heisey s...@elyograg.org wrote: On 4/25/2013 6:30 AM, Dmitry Kan wrote: We are very much interested in 3.4. On Thu, Apr 25, 2013 at 12:55 PM, Alan Woodward a...@flax.co.uk wrote: This is on top of trunk at the moment, but would be back ported to 4.4 if there was interest. This will be bad news, I'm sorry: All remaining work on 3.x versions happens in the 3.6 branch. This branch is in maintenance mode. It will only get fixes for serious bugs with no workaround. Improvements and new features won't be considered at all. You're welcome to try backporting patches from newer issues. Due to the major differences in the 3x and 4x codebases, the best case scenario is that you'll be facing a very manual task. Some changes can't be backported because they rely on other features only found in 4.x code. Thanks, Shawn
Re: commit in solr4 takes a longer time
Since you have define commit option as Auto Commit for hard and soft commit then you don't have to explicitly call commit from SolrJ client. And openSearcher=false for hard commit will make hard commit faster since it is only makes sure that recent changes are flushed to disk (for durability) and not opening any searcher. can you post you log when soft commit and hard commit happens? You can read about waitFlush=false and waitSearcher=false which are default to true, see below from java doc JavaDoc: *waitFlush* block until index changes are flushed to disk *waitSearcher* block until a new searcher is opened and registered as the main query searcher, making the changes visible*T* On Fri, May 3, 2013 at 7:19 AM, vicky desai vicky.de...@germinait.comwrote: Hi All, setting opensearcher flag to true solution worked and it give me visible improvement in commit time. One thing to make note of is that while using solrj client we have to call server.commit(false,false) which i was doing incorrectly and hence was not able to see the improvement earliear. Thanks everyone -- View this message in context: http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060688.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: commit in solr4 takes a longer time
Hi, When a auto commit operation is fired I am getting the following logs INFO: start commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} setting the openSearcher to false definetly gave me a lot of performance improvement but was wondering if waitSearcher can also be set to false and will that give me a performance raise too. -- View this message in context: http://lucene.472066.n3.nabble.com/commit-in-solr4-takes-a-longer-time-tp4060396p4060715.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: any plans to remove int32 limitation on the number of the documents in the index?
My off the cuff thought is that there are significant costs trying to do this that would be paid by 99.999% of setups out there. Also, usually you'll run into other issues (RAM etc) long before you come anywhere close to 2^31 docs. Lucene/Solr often allocates int[maxDoc] for various operations. when maxDoc approaches 2^31, well memory goes through the roof. Now consider allocating longs instead... which is a long way of saying that I don't really think anyone's going to be working on this any time soon, especially when SolrCloud removes a LOT of the pain /complexity (from a user perspective anyway) from going to a sharded setup... FWIW, Erick On Thu, May 2, 2013 at 1:17 PM, Valery Giner valgi...@research.att.com wrote: Otis, The documents themselves are relatively small, tens of fields, only a few of them could be up to a hundred bytes. Lunix Servers with relatively large RAM (256), Minutes on the searches are fine for our purposes, adding a few tens of millions of records in tens of minutes are also fine. We had to do some simple tricks for keeping indexing up to speed but nothing too fancy. Moving to the sharding adds a layer of complexity which we don't really need because of the above, ... and adding complexity may result in lower reliability :) Thanks, Val On 05/02/2013 03:41 PM, Otis Gospodnetic wrote: Val, Haven't seen this mentioned in a while... I'm curious...what sort of index, queries, hardware, and latency requirements do you have? Otis Solr ElasticSearch Support http://sematext.com/ On May 1, 2013 4:36 PM, Valery Giner valgi...@research.att.com wrote: Dear Solr Developers, I've been unable to find an answer to the question in the subject line of this e-mail, except of a vague one. We need to be able to index over 2bln+ documents. We were doing well without sharding until the number of docs hit the limit ( 2bln+). The performance was satisfactory for the queries, updates and indexing of new documents. That is, except for the need to go around the int32 limit, we don't really have a need for setting up distributed solr. I wonder whether some one on the solr team could tell us when/what version of solr we could expect the limit to be removed. I hope this question may be of interest to some one else :) -- Thanks, Val
Re: transientCacheSize doesn't seem to have any effect, except on startup
The cores aren't loaded (or at least shouldn't be) for getting the status. The _names_ of the cores should be returned, but those are (supposed) to be retrieved from a list rather than loaded cores. So are you sure that's not what you are seeing? How are you determining whether the cores are actually loaded or not? That said, it's perfectly possible that the status command is doing something we didn't anticipate, but I took a quick look at the code (got to rush to a plane) and CoreAdminHandler _appears_ to be just returning whatever info it can about an unloaded core for status. I _think_ you'll get more info if the core has ever been loaded though, even though if it's been removed from the transient cache. Ditto for the create action. So let's figure out whether you're really seeing loaded cores or not, and then raise a JIRA if so... Thanks for reporting! Erick On Thu, May 2, 2013 at 1:27 PM, didier deshommes dfdes...@gmail.com wrote: Hi, I've been very interested in the transient core feature of solr to manage a large number of cores. I'm especially interested in this use case, that the wiki lists at http://wiki.apache.org/solr/LotsOfCores (looks to be down now): loadOnStartup=false transient=true: This is really the use-case. There are a large number of cores in your system that are short-duration use. You want Solr to load them as necessary, but unload them when the cache gets full on an LRU basis. I'm creating 10 transient core via core admin like so $ curl http://localhost:8983/solr/admin/cores?wt=jsonaction=CREATEname=new_core2instanceDir=collection1/dataDir=new_core2transient=trueloadOnStartup=false and have transientCacheSize=2 in my solr.xml file, which I take means I should have at most 2 transient cores loaded at any time. The problem is that these cores are still loaded when when I ask solr to list cores: $ curl http://localhost:8983/solr/admin/cores?wt=jsonaction=status; From the explanation in the wiki, it looks like solr would manage loading and unloading transient cores for me without having to worry about them, but this is not what's happening. The situation is different when I restart solr; it does the right thing by loading the maximum cores set by transientCacheSize. When I add more cores, the old behavior happens again, where all created transient cores are loaded in solr. I'm using the development branch lucene_solr_4_3 to run my example. I can open a jira if need be.
Re: socket write error
After some more debugging I have found out, that one of the requests had a size of 4,4MB. The default maxPostSize in tomcat6 is 2MB ( http://tomcat.apache.org/tomcat-6.0-doc/config/ajp.html). Changing that to 10MB has greatly improved situation on the solr side. Dmitry On Fri, May 3, 2013 at 9:55 AM, Dmitry Kan solrexp...@gmail.com wrote: Digging in further, found this in HttpCommComponent class: [code] static { MultiThreadedHttpConnectionManager mgr = new MultiThreadedHttpConnectionManager(); mgr.getParams().setDefaultMaxConnectionsPerHost(20); mgr.getParams().setMaxTotalConnections(1); mgr.getParams().setConnectionTimeout(SearchHandler.connectionTimeout); mgr.getParams().setSoTimeout(SearchHandler.soTimeout); // mgr.getParams().setStaleCheckingEnabled(false); client = new HttpClient(mgr); } [/code] Could the value set by setDefaultMaxConnectionsPerHost(20) be to small for 80+ shards returning results to the router? Dmitry On Fri, May 3, 2013 at 6:50 AM, Dmitry Kan solrexp...@gmail.com wrote: Hi, thanks. Solr 3.4. There is POST request everywhere, between client and router, router and shards. Do you do faceting across all shards? How many documents approx you have? On 2 May 2013 22:02, Patanachai Tangchaisin patanachai.tangchai...@wizecommerce.com wrote: Hi, First, which version of Solr are you using? I also has 60 shards+ on Solr 4.2.1 and it doesn't seems to be a problem for me. - Make sure you use POST to send a query to Solr. - 'connection reset by peer' from client can indicate that there is something wrong with server e.g. server closes a connection etc. -- Patanachai On 05/02/2013 05:05 AM, Dmitry Kan wrote: After some searching around, I see this: http://search-lucene.com/m/**ErEZUl7P5f2/%2522socket+write+** error%2522subj=Long+list+of+**shards+breaks+solrj+queryhttp://search-lucene.com/m/ErEZUl7P5f2/%2522socket+write+error%2522subj=Long+list+of+shards+breaks+solrj+query Seems like this has happened in the past with large amount of shards. To make it clear: the distributed search works with 20 shards. On Thu, May 2, 2013 at 1:57 PM, Dmitry Kan solrexp...@gmail.com wrote: Hi guys! We have solr router and shards. I see this in jetty log on the router: May 02, 2013 1:30:22 PM org.apache.commons.httpclient.** HttpMethodDirector executeWithRetry INFO: I/O exception (java.net.SocketException) caught when processing request: Connection reset by peer: socket write error and then: May 02, 2013 1:30:22 PM org.apache.commons.httpclient.** HttpMethodDirector executeWithRetry INFO: Retrying request followed by exception about Internal Server Error any ideas why this happens? We run 80+ shards distributed across several servers. Router runs on its own node. Is there anything in particular I should be looking into wrt ubuntu socket settings? Is this a known issue for solr's distributed search from the past? Thanks, Dmitry CONFIDENTIALITY NOTICE == This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
Re: commit in solr4 takes a longer time
On 5/3/2013 9:28 AM, vicky desai wrote: Hi, When a auto commit operation is fired I am getting the following logs INFO: start commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false} setting the openSearcher to false definetly gave me a lot of performance improvement but was wondering if waitSearcher can also be set to false and will that give me a performance raise too. The openSearcher parameter changes what actually happens when you do a hard commit, so using it can change your performance. The wait parameters are for client software that does commits. The idea is that if you don't want your client to wait for the commit to finish, you use these options so that the commit API call will return quickly and the server will finish the commit in the background. It doesn't change what the commit does, it just allows the client to start doing other things. With auto commits, the client and the server are both Solr, and everything is multi-threaded. The wait parameters have no meaning, because there's no user software that has to wait. There would be no performance gain from turning them off. Side note: The waitFlush parameter was completely removed in Solr 4.0. Thanks, Shawn
Re: The HttpSolrServer add(CollectionSolrInputDocument docs) method is not atomic.
bq: Is there a way to commit multiple documents/beans in a transaction/together in a way that it succeeds completely or fails completely? Not that I know of. I've seen various divide and conquer strategies to identify _which_ document failed, but the general process is usually to re-index the docs in smaller chunks until you isolate the offending one and trust that re-indexing documents will be OK since it overwrites the earlier copiy. Best Erick On Thu, May 2, 2013 at 7:53 PM, mark12345 marks1900-pos...@yahoo.com.au wrote: One thing I noticed is that while the HttpSolrServer add(SolrInputDocument doc) method is atomic (Either a bean is added or an exception is thrown), the HttpSolrServer add(CollectionSolrInputDocument docs) method is not atomic. Question: Is there a way to commit multiple documents/beans in a transaction/together in a way that it succeeds completely or fails completely? Quick outline of what I did to highlight a call to HttpSolrServer add(CollectionSolrInputDocument docs) method is not atomic. 1. Create 5 documents, comprising of 4 valid documents (Documents 1,2,4,5) and 1 document with an issue, document 3. 2. Call to HttpSolrServer add(CollectionSolrInputDocument docs) which threw a SolrException. 3. Call to HttpSolrServer commit(). 4. Discovered that 2 out of 5 (documents 1 and 2) documents where still commited. -- View this message in context: http://lucene.472066.n3.nabble.com/SolrJ-Solr-Two-Phase-Commit-tp4060399p4060590.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Query across multiple shards - key fields have different names
I don't think you can. Problem is that the pseudo join capability can work cross core, which meas with two separate cores, but last I knew distributed joins aren't supported which is what you're asking for. Really think about flattening your data if at all possible. Best Erick On Thu, May 2, 2013 at 11:03 PM, Benjamin Ryan benjamin.r...@manchester.ac.uk wrote: Hi, Sorry for the basic question - I can't get to the WiKi to find the answer. Version Solr 3.3.0 I have two separate indexes (currently in two cores but can be moved to shards) One core holds metadata about educational resources, the other usage statistics They have a common value named id in one core and search.resourceid in the other core. How can I construct a shard query (once I have moved one the cores to a different node) so that I can effectively get the statistics for each educational resource grouped by each resource? This is an offline reporting job that needs to list the usage events for educational resources over a time period (the usage events have a date/time field. Regards, Ben -- Dr Ben Ryan Jorum Technical Manager 5.12 Roscoe Building The University of Manchester Oxford Road Manchester M13 9PL Tel: 0160 275 6039 E-mail: benjamin.r...@manchester.ac.ukhttps://outlook.manchester.ac.uk/owa/redir.aspx?C=b28b5bdd1a91425abf8e32748c93f487URL=mailto%3abenjamin.ryan%40manchester.ac.uk --
Re: Duplicated Documents Across shards
What version of Solr? The custom routing stuff is quite new so I'm guessing 4x? But this shouldn't be happening. The actual index data for the shards should be in separate directories, they just happen to be on the same physical machine. Try querying each one with distrib=false to see the counts from single shards, that may shed some light on this. It vaguely sounds like you have indexed the same document to both shards somehow... Best Erick On Fri, May 3, 2013 at 5:28 AM, Iker Mtnz. Apellaniz mitxin...@gmail.com wrote: Hi, We have currently a solrCloud implementation running 5 shards in 3 physical machines, so the first machine will have the shard number 1, the second machine shards 2 4, and the third shards 3 5. We noticed that while queryng numFoundDocs decreased when we increased the start param. After some investigation we found that the documents in shards 2 to 5 were being counted twice. Querying to shard 2 will give you back the results for shard 2 4, and the same thing for shards 3 5. Our guess is that the physical index for both shard 24 is shared, so the shards don't know which part of it is for each one. The uniqueKey is correctly defined, and we have tried using shard prefix (shard1!docID). Is there any way to solve this problem when a unique physical machine shares shards? Is it a real problem os it just affects facet numResults? Thanks Iker -- /** @author imartinez*/ Person me = *new* Developer(); me.setName(*Iker Mtz de Apellaniz Anzuola*); me.setTwit(@mitxino77 https://twitter.com/mitxino77); me.setLocations({St Cugat, Barcelona, Kanpezu, Euskadi, *, World]}); me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*}); me.setWebs({*urbasaabentura.com, ikertxef.com*}); *return* me;
Re: Solr 4 reload failed core
It seems odd, but consider create rather than reload. Create will load up an existing core, think of it as create in memory rather than create on disk for the case where there's already an index. Best Erick On Fri, May 3, 2013 at 6:27 AM, Peter Kirk p...@alpha-solutions.dk wrote: Hi I have a multi-core installation, with 2 cores. Sometimes, when Solr starts up, one of the cores fails (due to an extension to Solr I have, which is waiting on an external service which has yet to initialise). In previous versions of Solr, I could subsequently issue a RELOAD to this core, even though it was in a fail state, and it would reload and start up. Now it seems with Solr 4, I cannot issue a RELOAD to a core which has failed. Is this the case? How can I get Solr to start a core which failed on initial start up? Thanks, Peter
Re: Delete from Solr Cloud 4.0 index..
Anette: Be a little careful with the index size savings, they really don't mean much for _searching_. The sotred field compression significantly reduces the size on disk, but only for the stored data which is only accessed when returning the top N docs. In terms of how many docs you can fit on your hardware, it's pretty irrelevant. The *.fdt and *.fdx files in your index directory contain the stored data, so when looking at the effects of various options (including compression), you can pretty much ignore these files. FWIW, Erick On Fri, May 3, 2013 at 2:03 AM, Annette Newton annette.new...@servicetick.com wrote: Thanks Shawn. I have played around with Soft Commits before and didn't seem to have any improvement, but with the current load testing I am doing I will give it another go. I have researched docValues and came across the fact that it would increase the index size. With the upgrade to 4.2.1 the index size has reduced by approx 33% which is pleasing and I don't really want to lose that saving. We do use the facet.enum method - which works really well, but I will verify that we are using that in every instance, we have numerous developers working on the product and maybe one or two have slipped through. Right from the first I upped the zkClientTimeout to 30 as I wanted to give extra time for any network blips that we experience on AWS. We only seem to drop communication on a full garbage collection though. I am coming to the conclusion that we need to have more shards to cope with the writes, so I will play around with adding more shards and see how I go. I appreciate you having a look over our setup and the advice. Thanks again. Netty. On 2 May 2013 23:17, Shawn Heisey s...@elyograg.org wrote: On 5/2/2013 4:24 AM, Annette Newton wrote: Hi Shawn, Thanks so much for your response. We basically are very write intensive and write throughput is pretty essential to our product. Reads are sporadic and actually is functioning really well. We write on average (at the moment) 8-12 batches of 35 documents per minute. But we really will be looking to write more in the future, so need to work out scaling of solr and how to cope with more volume. Schema (I have changed the names) : http://pastebin.com/x1ry7ieW Config: http://pastebin.com/pqjTCa7L This is very clean. There's probably more you could remove/comment, but generally speaking I couldn't find any glaring issues. In particular, you have disabled autowarming, which is a major contributor to commit speed problems. The first thing I think I'd try is increasing zkClientTimeout to 30 or 60 seconds. You can use the startup commandline or solr.xml, I would probably use the latter. Here's a solr.xml fragment that uses a system property or a 15 second default: ?xml version=1.0 encoding=UTF-8 ? solr persistent=true sharedLib=lib cores adminPath=/admin/cores zkClientTimeout=${zkClientTimeout:15000} hostPort=${jetty.port:} hostContext=solr General thoughts, these changes might not help this particular issue: You've got autoCommit with openSearcher=true. This is a hard commit. If it were me, I would set that up with openSearcher=false and either do explicit soft commits from my application or set up autoSoftCommit with a shorter timeframe than autoCommit. This might simply be a scaling issue, where you'll need to spread the load wider than four shards. I know that there are financial considerations with that, and they might not be small, so let's leave that alone for now. The memory problems might be a symptom/cause of the scaling issue I just mentioned. You said you're using facets, which can be a real memory hog even with only a few of them. Have you tried facet.method=enum to see how it performs? You'd need to switch to it exclusively, never go with the default of fc. You could put that in the defaults or invariants section of your request handler(s). Another way to reduce memory usage for facets is to use disk-based docValues on version 4.2 or later for the facet fields, but this will increase your index size, and your index is already quite large. Depending on your index contents, the increase may be small or large. Something to just mention: It looks like your solrconfig.xml has hard-coded absolute paths for dataDir and updateLog. This is fine if you'll only ever have one core/collection on each server, but it'll be a disaster if you have multiples. I could be wrong about how these get interpreted in SolrCloud -- they might actually be relative despite starting with a slash. Thanks, Shawn -- Annette Newton Database Administrator ServiceTick Ltd T:+44(0)1603 618326 Seebohm House, 2-4 Queen Street, Norwich, England NR2 4SQ www.servicetick.com *www.sessioncam.com* -- *This message is confidential and is intended to be read solely by the addressee. The contents should not be disclosed to any other
Re: Configure Shingle Filter to ignore ngrams made of tokens with same start and end
In short, no. I don't think you want to use the shingle filter on a token stream that has multiple tokens at the same position, otherwise, you will get confused suggestions, as you've encountered. -- Jack Krupansky -Original Message- From: Rounak Jain Sent: Friday, May 03, 2013 7:34 AM To: solr-user@lucene.apache.org Subject: Configure Shingle Filter to ignore ngrams made of tokens with same start and end Hello, I was using Shingle Fitler with Suggester to implement an autosuggest dropdown. The field I'm using with shingle filter has a worddelimiter with preserveoriginal=1 to tokenize women's as women's and womens. Because of this, when shingle filter is generating word ngrams, apart from the expected tokens, there's also a women's womens tokens. I wanted to know if there's any way to configure ShingleFilter so that it ignores tokens with same start and end values. Thanks, Rounak
SV: Solr 4 reload failed core
Thanks - I had just found the CREATE command, and I think that's the easiest path for us to take. It will actually basically function as our reload workaround works now. Fra: Erick Erickson [erickerick...@gmail.com] Sendt: 3. maj 2013 19:22 Til: solr-user@lucene.apache.org Emne: Re: Solr 4 reload failed core It seems odd, but consider create rather than reload. Create will load up an existing core, think of it as create in memory rather than create on disk for the case where there's already an index. Best Erick On Fri, May 3, 2013 at 6:27 AM, Peter Kirk p...@alpha-solutions.dk wrote: Hi I have a multi-core installation, with 2 cores. Sometimes, when Solr starts up, one of the cores fails (due to an extension to Solr I have, which is waiting on an external service which has yet to initialise). In previous versions of Solr, I could subsequently issue a RELOAD to this core, even though it was in a fail state, and it would reload and start up. Now it seems with Solr 4, I cannot issue a RELOAD to a core which has failed. Is this the case? How can I get Solr to start a core which failed on initial start up? Thanks, Peter
Re: Configure Shingle Filter to ignore ngrams made of tokens with same start and end
The shingle filter should respect positions. If it doesn't, that is worth filing a bug so we know about it. wunder On May 3, 2013, at 10:50 AM, Jack Krupansky wrote: In short, no. I don't think you want to use the shingle filter on a token stream that has multiple tokens at the same position, otherwise, you will get confused suggestions, as you've encountered. -- Jack Krupansky -Original Message- From: Rounak Jain Sent: Friday, May 03, 2013 7:34 AM To: solr-user@lucene.apache.org Subject: Configure Shingle Filter to ignore ngrams made of tokens with same start and end Hello, I was using Shingle Fitler with Suggester to implement an autosuggest dropdown. The field I'm using with shingle filter has a worddelimiter with preserveoriginal=1 to tokenize women's as women's and womens. Because of this, when shingle filter is generating word ngrams, apart from the expected tokens, there's also a women's womens tokens. I wanted to know if there's any way to configure ShingleFilter so that it ignores tokens with same start and end values. Thanks, Rounak
custom tokenizer error
I am using a custom Tokenizer, as part of analysis chain, for a Solr (4.2.1) field. On trying to index, Solr throws a NullPointerException. The unit tests for the custom tokenizer work fine. Any ideas as to what is it that I am missing/doing incorrectly will be appreciated. Here is the relevant schema.xml excerpt: fieldType name=negated class=solr.TextField omitNorms=true analyzer type=index tokenizer class=some.other.solr.analysis.EmbeddedPunctuationTokenizer$Factory/ filter class=solr.LowerCaseFilterFactory/ filter class=solr.EnglishPossessiveFilterFactory/ /analyzer /fieldType Here are the relevant pieces of the Tokenizer: /** * Intercepts each token produced by {@link StandardTokenizer#incrementToken()} * and checks for the presence of a colon or period. If found, splits the token * on the punctuation mark and adjusts the term and offset attributes of the * underlying {@link TokenStream} to create additional tokens. * * */ public class EmbeddedPunctuationTokenizer extends Tokenizer { private static final Pattern PUNCTUATION_SYMBOLS = Pattern.compile([:.]); private StandardTokenizer baseTokenizer; private CharTermAttribute termAttr; private OffsetAttribute offsetAttr; private /*@Nullable*/ String tokenAfterPunctuation = null; private int currentOffset = 0; public EmbeddedPunctuationTokenizer(final Reader reader) { super(reader); baseTokenizer = new StandardTokenizer(Version.MINIMUM_LUCENE_VERSION, reader); // Two TokenStreams are in play here: the one underlying the current // instance and the one underlying the StandardTokenizer. The attribute // instances must be associated with both. termAttr = baseTokenizer.addAttribute(CharTermAttribute.class); offsetAttr = baseTokenizer.addAttribute(OffsetAttribute.class); this.addAttributeImpl((CharTermAttributeImpl)termAttr); this.addAttributeImpl((OffsetAttributeImpl)offsetAttr); } @Override public void end() throws IOException { baseTokenizer.end(); super.end(); } @Override public void close() throws IOException { baseTokenizer.close(); super.close(); } @Override public void reset() throws IOException { super.reset(); baseTokenizer.reset(); currentOffset = 0; tokenAfterPunctuation = null; } @Override public final boolean incrementToken() throws IOException { clearAttributes(); if (tokenAfterPunctuation != null) { // Do not advance the underlying TokenStream if the previous call // found an embedded punctuation mark and set aside the substring // that follows it. Set the attributes instead from the substring, // bearing in mind that the substring could contain more embedded // punctuation marks. adjustAttributes(tokenAfterPunctuation); } else if (baseTokenizer.incrementToken()) { // No remaining substring from a token with embedded punctuation: save // the starting offset reported by the base tokenizer as the current // offset, then proceed with the analysis of token it returned. currentOffset = offsetAttr.startOffset(); adjustAttributes(termAttr.toString()); } else { // No more tokens in the underlying token stream: return false return false; } return true; } private void adjustAttributes(final String token) { Matcher m = PUNCTUATION_SYMBOLS.matcher(token); if (m.find()) { int index = m.start(); offsetAttr.setOffset(currentOffset, currentOffset + index); termAttr.copyBuffer(token.toCharArray(), 0, index); tokenAfterPunctuation = token.substring(index + 1); // Given that the incoming token had an embedded punctuation mark, // the starting offset for the substring following the punctuation // mark will be 1 beyond the end of the current token, which is the // substring preceding embedded punctuation mark. currentOffset = offsetAttr.endOffset() + 1; } else if (tokenAfterPunctuation != null) { // Last remaining substring following a previously detected embedded // punctuation mark: adjust attributes based on its values. int length = tokenAfterPunctuation.length(); termAttr.copyBuffer(tokenAfterPunctuation.toCharArray(), 0, length); offsetAttr.setOffset(currentOffset, currentOffset + length); tokenAfterPunctuation = null; } // Implied else: neither is true so attributes from base tokenizer need // no adjustments. } } } Solr throws the following error, in the 'else if' block of #incrementToken 2013-04-29 14:19:48,920 [http-thread-pool-8080(3)] ERROR org.apache.solr.core.SolrCore - java.lang.NullPointerException at org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:923) at org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:1133) at org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:180) at some.other.solr.analysis.EmbeddedPunctuationTokenizer.incrementToken(EmbeddedPunctuationTokenizer.java:83) at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
Re: transientCacheSize doesn't seem to have any effect, except on startup
On Fri, May 3, 2013 at 11:18 AM, Erick Erickson erickerick...@gmail.comwrote: The cores aren't loaded (or at least shouldn't be) for getting the status. The _names_ of the cores should be returned, but those are (supposed) to be retrieved from a list rather than loaded cores. So are you sure that's not what you are seeing? How are you determining whether the cores are actually loaded or not? I'm looking at the output of : $ curl http://localhost:8983/solr/admin/cores?wt=jsonaction=status; cores that are loaded have a startTime and upTime value. Cores that are unloaded don't appear in the output at all. For example, I created 3 transient cores with transientCacheSize=2 . When I asked for a list of all cores, all 3 cores were returned. I explicitly unloaded 1 core and got back 2 cores when I asked for the list again. It would be nice if cores had a isTransient and a isCurrentlyLoaded value so that one could see exactly which cores are loaded. That said, it's perfectly possible that the status command is doing something we didn't anticipate, but I took a quick look at the code (got to rush to a plane) and CoreAdminHandler _appears_ to be just returning whatever info it can about an unloaded core for status. I _think_ you'll get more info if the core has ever been loaded though, even though if it's been removed from the transient cache. Ditto for the create action. So let's figure out whether you're really seeing loaded cores or not, and then raise a JIRA if so... Thanks for reporting! Erick On Thu, May 2, 2013 at 1:27 PM, didier deshommes dfdes...@gmail.com wrote: Hi, I've been very interested in the transient core feature of solr to manage a large number of cores. I'm especially interested in this use case, that the wiki lists at http://wiki.apache.org/solr/LotsOfCores (looks to be down now): loadOnStartup=false transient=true: This is really the use-case. There are a large number of cores in your system that are short-duration use. You want Solr to load them as necessary, but unload them when the cache gets full on an LRU basis. I'm creating 10 transient core via core admin like so $ curl http://localhost:8983/solr/admin/cores?wt=jsonaction=CREATEname=new_core2instanceDir=collection1/dataDir=new_core2transient=trueloadOnStartup=false and have transientCacheSize=2 in my solr.xml file, which I take means I should have at most 2 transient cores loaded at any time. The problem is that these cores are still loaded when when I ask solr to list cores: $ curl http://localhost:8983/solr/admin/cores?wt=jsonaction=status; From the explanation in the wiki, it looks like solr would manage loading and unloading transient cores for me without having to worry about them, but this is not what's happening. The situation is different when I restart solr; it does the right thing by loading the maximum cores set by transientCacheSize. When I add more cores, the old behavior happens again, where all created transient cores are loaded in solr. I'm using the development branch lucene_solr_4_3 to run my example. I can open a jira if need be.
disaster recovery scenarios for solr cloud and zookeeper
Hi, Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is expected to have a very high (perfect?) availability. With 3 or 5 zookeeper nodes, it is possible to manage zookeeper maintenance and online availability to be close to %100. But what is the worst case for Solr if for some unanticipated reason all Zookeeper nodes go offline? Could someone comment on a couple of possible scenarios for which all ZK nodes are offline. What would happen to Solr and what would be needed to recover in each case? 1) brief interruption, say 2 minutes, 2) longer downtime, say 60 min Thanks Dennis
Re: Duplicated Documents Across shards
We are currently using version 4.2. We have made tests with a single document and it gives us a 2 document count. But if we force to shard into te first machine, the one with a unique shard, the count gives us 1 document. I've tried using distrib=false parameter, it gives us no duplicate documents, but the same document appears to be in two different shards. Finally, about the separate directories, We have only one directory for the data in each physical machine and collection, and I don't see any subfolder for the different shards. Is it possible that we have something wrong with the dataDir configuration to use multiple shards in one machine? dataDir${solr.data.dir:}/dataDir directoryFactory name=DirectoryFactory class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/ 2013/5/3 Erick Erickson erickerick...@gmail.com What version of Solr? The custom routing stuff is quite new so I'm guessing 4x? But this shouldn't be happening. The actual index data for the shards should be in separate directories, they just happen to be on the same physical machine. Try querying each one with distrib=false to see the counts from single shards, that may shed some light on this. It vaguely sounds like you have indexed the same document to both shards somehow... Best Erick On Fri, May 3, 2013 at 5:28 AM, Iker Mtnz. Apellaniz mitxin...@gmail.com wrote: Hi, We have currently a solrCloud implementation running 5 shards in 3 physical machines, so the first machine will have the shard number 1, the second machine shards 2 4, and the third shards 3 5. We noticed that while queryng numFoundDocs decreased when we increased the start param. After some investigation we found that the documents in shards 2 to 5 were being counted twice. Querying to shard 2 will give you back the results for shard 2 4, and the same thing for shards 3 5. Our guess is that the physical index for both shard 24 is shared, so the shards don't know which part of it is for each one. The uniqueKey is correctly defined, and we have tried using shard prefix (shard1!docID). Is there any way to solve this problem when a unique physical machine shares shards? Is it a real problem os it just affects facet numResults? Thanks Iker -- /** @author imartinez*/ Person me = *new* Developer(); me.setName(*Iker Mtz de Apellaniz Anzuola*); me.setTwit(@mitxino77 https://twitter.com/mitxino77); me.setLocations({St Cugat, Barcelona, Kanpezu, Euskadi, *, World]}); me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*}); me.setWebs({*urbasaabentura.com, ikertxef.com*}); *return* me; -- /** @author imartinez*/ Person me = *new* Developer(); me.setName(*Iker Mtz de Apellaniz Anzuola*); me.setTwit(@mitxino77 https://twitter.com/mitxino77); me.setLocations({St Cugat, Barcelona, Kanpezu, Euskadi, *, World]}); me.setSkills({*SoftwareDeveloper, Curious, AmateurCook*}); *return* me;
Re: Configure Shingle Filter to ignore ngrams made of tokens with same start and end
An issue exists for this problem: https://issues.apache.org/jira/browse/LUCENE-3475 On May 3, 2013, at 11:00 AM, Walter Underwood wun...@wunderwood.org wrote: The shingle filter should respect positions. If it doesn't, that is worth filing a bug so we know about it. wunder On May 3, 2013, at 10:50 AM, Jack Krupansky wrote: In short, no. I don't think you want to use the shingle filter on a token stream that has multiple tokens at the same position, otherwise, you will get confused suggestions, as you've encountered. -- Jack Krupansky -Original Message- From: Rounak Jain Sent: Friday, May 03, 2013 7:34 AM To: solr-user@lucene.apache.org Subject: Configure Shingle Filter to ignore ngrams made of tokens with same start and end Hello, I was using Shingle Fitler with Suggester to implement an autosuggest dropdown. The field I'm using with shingle filter has a worddelimiter with preserveoriginal=1 to tokenize women's as women's and womens. Because of this, when shingle filter is generating word ngrams, apart from the expected tokens, there's also a women's womens tokens. I wanted to know if there's any way to configure ShingleFilter so that it ignores tokens with same start and end values. Thanks, Rounak
Re: disaster recovery scenarios for solr cloud and zookeeper
I *think* at this point SolrCloud without ZooKeeper is like a . body without a head? Otis -- Solr ElasticSearch Support http://sematext.com/ On Fri, May 3, 2013 at 3:21 PM, Dennis Haller dhal...@talenttech.com wrote: Hi, Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is expected to have a very high (perfect?) availability. With 3 or 5 zookeeper nodes, it is possible to manage zookeeper maintenance and online availability to be close to %100. But what is the worst case for Solr if for some unanticipated reason all Zookeeper nodes go offline? Could someone comment on a couple of possible scenarios for which all ZK nodes are offline. What would happen to Solr and what would be needed to recover in each case? 1) brief interruption, say 2 minutes, 2) longer downtime, say 60 min Thanks Dennis
Re: disaster recovery scenarios for solr cloud and zookeeper
Ideally, the Solr nodes should be able to continue as long as no node fails. Failure of a leader would be bad, failure of non-leader replicas might cause some timeouts, but could be survivable. Of course, nodes could not be added. wunder On May 3, 2013, at 5:05 PM, Otis Gospodnetic wrote: I *think* at this point SolrCloud without ZooKeeper is like a . body without a head? Otis -- Solr ElasticSearch Support http://sematext.com/ On Fri, May 3, 2013 at 3:21 PM, Dennis Haller dhal...@talenttech.com wrote: Hi, Solr 4.x is architected with a dependency on Zookeeper, and Zookeeper is expected to have a very high (perfect?) availability. With 3 or 5 zookeeper nodes, it is possible to manage zookeeper maintenance and online availability to be close to %100. But what is the worst case for Solr if for some unanticipated reason all Zookeeper nodes go offline? Could someone comment on a couple of possible scenarios for which all ZK nodes are offline. What would happen to Solr and what would be needed to recover in each case? 1) brief interruption, say 2 minutes, 2) longer downtime, say 60 min Thanks Dennis -- Walter Underwood wun...@wunderwood.org
Re: disaster recovery scenarios for solr cloud and zookeeper
On 5/3/2013 6:07 PM, Walter Underwood wrote: Ideally, the Solr nodes should be able to continue as long as no node fails. Failure of a leader would be bad, failure of non-leader replicas might cause some timeouts, but could be survivable. Of course, nodes could not be added. I have read a few things that say things go read only when the zookeeper ensemble loses quorum. I'm not sure whether that means that Solr goes read only or zookeeper goes read only. I would be interested in knowing exactly what happens when zookeeper loses quorum as well as what happens if all three (or more) zookeeper nodes in the ensemble go away entirely. I have a SolrCloud I can experiment with, but I need to find a maintenance window for testing, so I can't check right now. Thanks, Shawn
Re: disaster recovery scenarios for solr cloud and zookeeper
In case all your Zk nodes go down, the querying would continue to work fine (as far as no nodes fail) but you'd not be able to add docs. Sent from my iPhone On 03-May-2013, at 17:52, Shawn Heisey s...@elyograg.org wrote: On 5/3/2013 6:07 PM, Walter Underwood wrote: Ideally, the Solr nodes should be able to continue as long as no node fails. Failure of a leader would be bad, failure of non-leader replicas might cause some timeouts, but could be survivable. Of course, nodes could not be added. I have read a few things that say things go read only when the zookeeper ensemble loses quorum. I'm not sure whether that means that Solr goes read only or zookeeper goes read only. I would be interested in knowing exactly what happens when zookeeper loses quorum as well as what happens if all three (or more) zookeeper nodes in the ensemble go away entirely. I have a SolrCloud I can experiment with, but I need to find a maintenance window for testing, so I can't check right now. Thanks, Shawn
Re: disaster recovery scenarios for solr cloud and zookeeper
agree with Anshum and Netflix has very nice supervisor system for ZooKeeper if they goes down it will restart them automatically http://techblog.netflix.com/2012/04/introducing-exhibitor-supervisor-system.html https://github.com/Netflix/exhibitor On Fri, May 3, 2013 at 6:53 PM, Anshum Gupta ans...@anshumgupta.net wrote: In case all your Zk nodes go down, the querying would continue to work fine (as far as no nodes fail) but you'd not be able to add docs. Sent from my iPhone On 03-May-2013, at 17:52, Shawn Heisey s...@elyograg.org wrote: On 5/3/2013 6:07 PM, Walter Underwood wrote: Ideally, the Solr nodes should be able to continue as long as no node fails. Failure of a leader would be bad, failure of non-leader replicas might cause some timeouts, but could be survivable. Of course, nodes could not be added. I have read a few things that say things go read only when the zookeeper ensemble loses quorum. I'm not sure whether that means that Solr goes read only or zookeeper goes read only. I would be interested in knowing exactly what happens when zookeeper loses quorum as well as what happens if all three (or more) zookeeper nodes in the ensemble go away entirely. I have a SolrCloud I can experiment with, but I need to find a maintenance window for testing, so I can't check right now. Thanks, Shawn