negative array size exception
After migrating from solr to a load balanced solrcloud with 3 ZKs on the same machines and solr has 3 shards (one per node) We see this logged in the UI on one of our solrs. Does anyone know what this is symptomatic of? java.lang.NegativeArraySizeException at org.apache.lucene.util.PriorityQueue.(PriorityQueue.java:63) at org.apache.lucene.util.PriorityQueue.(PriorityQueue.java:44) at org.apache.solr.handler.component.ShardFieldSortedHitQueue.(ShardFieldSortedHitQueue.java:45) at org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:979) at org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:763) at org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:742) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:428) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2306) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:464) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:296) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) at org.eclipse.jetty.server.Server.handle(Server.java:534) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95) at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang.Thread.run(Thread.java:745)
solr to solrcloud
Our out of the box solr 5.4.1 installation cannot handle the 50gb analytics index anymore. We are using sitecore 8.1 and planning to go to 8.2 but when we tried went to 8.2 we rebuild the indexes but the sight was very unresponsive and was missing items and was too slow. We ended up giving that solr server over 92gb of RAM and saw that java.exe was needing about 60gb to process our massive index. Even then we couldn't get performance back into the site and decided to roll back to 8.1. We looked up options of scaling out horizontally because we cannot keep adding RAM to one solr server. To go to solrcloud we built 3 ubuntu 14.04.5 servers with a 50gb VM disk for the indexes and the other VM disk for the OS, zookeeper, java, tomcat and solr applications. It has 32GBs of RAM (on all 3 servers). When we move to solrcloud on these servers what is the best way to set up the solrcloud environment so they can take the data that already exists in our current solr? We have about 16 indexes for sitecore with the biggest one being analytics (around 45-50gbs). Thanks, Darren Walker
Re: Search opening hours
Sounds odd that the indexing times would change. Hopefully something else was going on - I've not experienced this. On Tue, Sep 8, 2015 at 4:31 AM, O. Klein <kl...@octoweb.nl> wrote: > BTW any idea how index speed is influenced? > > I used worldbounds with -1 and 1 y-axes. But figured this could also be 0. > > After changing to 0 indexing became a lot slower though (no exceptions in > log). > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4227531.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Darren
Re: Search opening hours
I think the client code has to normalize the input. There are methods in the spatial libraries that will do this - or maybe I wrote them my code, can't remember. How are you handling parsing the hours? - Darren > On Sep 6, 2015, at 4:56 PM, O. Klein <kl...@octoweb.nl> wrote: > > Saw that, but not a lot of info about it. > > From my understanding, the way it supposed to work is that a value bigger > then boundary get's normalized. > > I just get an exception "bad x not in boundary rect" > > Any pointers? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4227384.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Search opening hours
So thanks to the tireless efforts of David Smiley and the devs at Vivid Solutions (not to mention the various contributors that help power Solr and Lucene) spatial search is awesome, efficient and easy. The biggest roadblock I've run into is not having the JTS (Java Topology Suite) JAR where Solr can find it. It doesn't ship with Solr OOB so you have to either add it to one of the dynamic directories, or bundle it with the WAR (I think pre-5.0). The link above has most of what you need to index data and issue queries. I'd also suggest the sections on spatial search in Solr In Action (Grainger, Potter) - they add a few more use cases that I've found interesting. Finally, the aging wiki has some good info too: http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4 Basically indexing spatial data is as easy as anything else: define the field in the solrconfig.xml, create the data and push it in. Now the data in this case are boxes or polygons (effectively the same here) and come in a specific format known as WKT, or Well-Known-Text https://en.wikipedia.org/wiki/Well-known_text. I'd say unless you're aiming at an advanced use case set the max dist error on the field config a little higher than normal - precision isn't really a requirement here and good unit tests would alert you to any unforeseen issues. Then for the query side of the world you just ask for point inclusion like: q=+polygon:Contains(POINT(my_long my_lat)) Please note that WKT reverses the order of lat/lng because it uses euclidean geometry heuristics (so X=longitude and Y=latitude). Can't tell you how many times my brain hurt thanks to this idiom combined with janky client logic :) Anyway, that's about it - let me know if you have any other questions. On Wed, Aug 26, 2015 at 1:56 PM, O. Klein kl...@octoweb.nl wrote: Darren, This sounds like solution I'm looking for. Especially nice fix for the Sunday-Monday problem. Never worked with spatial search before, so any pointers are welcome. Will start working on this solution. -- View this message in context: http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225443.html Sent from the Solr - User mailing list archive at Nabble.com. -- Darren
Re: Search opening hours
Sorry - didn't finish my thought. I need to address querying :) So using the above to define what's in the index your queries for a day/time become a CONTAINS operation against the field. Let's say that the field is defined as a location_rpt using JTS and its Spatial Factory (which supports polygons) - oh, and it would need to be multi-valued. Querying the field would require first translating now or in an hour or Monday at 9am to a geocode, then hitting the index with a CONTAINS request per the docs: https://cwiki.apache.org/confluence/display/solr/Spatial+Search On Wed, Aug 26, 2015 at 11:23 AM, Darren Spehr darre...@gmail.com wrote: Sure - and sorry for its density. I reread it and thought the same ;) So imagine a polygon of say 1/2 mile width (I made that up) that stretches around the equator. Let's call this a week's timeline and subdivide it into 7 blocks, one for each day. For the sake of simplicity assume it's a line (which I forget but is supported in Solr as an infinitely small polygon) starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for Sunday at 11:59 PM. By subdivide you can think of it either radially or by longitude, but you have 360 degrees to divide into 7, which means that every hour is represented by a range of roughly 2.143 degrees (360/7/24). These regions represent each day and hour (or less), and the region boundaries represent midnight for the day before. Now for indexing - your open hours then become a combination of these subdivisions. If you're open 24x7 then the whole polygon is indexed. If you're only open on Monday from 9-5 then only the polygon between (0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you can index any combination of times this way. So now the varsity question is how to do this with a fluctuating calendar? I think this example can be extended to include searching against any given day of the week in a year, or years. Just imagine a translation layer that adjusts the latitude N or S by some amount to represent which day in which year you're looking for. Make sense? On Wed, Aug 26, 2015 at 10:50 AM, Upayavira u...@odoko.co.uk wrote: delightfully dense = really intriguing, but I couldn't quite understand it - really hoping for more info On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote: Darren, That was delightfully dense. Do you think you could unpack it a bit more? Possibly some sample (pseudo) queries? Upayavira On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote: If you wanted to try a spatial approach that blended times like above, you could try a polygon of minimum width that spans the globe - this is literally using spatial search (geocodes) against time. So in this scenario you logically subdivide the polygon into 7 distinct regions (for days) and then within this you can defined, like a timeline, what open and closed means. The problem of 3AM is taken care of because of it's continuous nature - ie one day is adjacent to the next, with Sunday and Monday backing up to each other. Just a thought. On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote: On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote: Those options don't fix my problem with closing times the next morning, or is there a way to do this? Use the spatial model, and a time window of a week. There are 10,080 minutes in a week, so you could use that as your scale. Assuming the week starts at 00:00 Monday morning, you might index Monday 9:00-23:00 as 540:1380 Tuesday 9am-Wednesday 1am would be 1980:2940 You convert your NOW time into a minutes since Monday 00:00 and do a spatial search within that time. If it is now Monday, 11:23am, that would be 11*60+23=683, so you would do a search for 683:683. If you have a shop that is open over Sunday night to Monday, you just list it as open until Sunday 23:59 and open again Monday 00:00. Would that do it? Upayavira -- Darren -- Darren -- Darren
Re: Search opening hours
If you wanted to try a spatial approach that blended times like above, you could try a polygon of minimum width that spans the globe - this is literally using spatial search (geocodes) against time. So in this scenario you logically subdivide the polygon into 7 distinct regions (for days) and then within this you can defined, like a timeline, what open and closed means. The problem of 3AM is taken care of because of it's continuous nature - ie one day is adjacent to the next, with Sunday and Monday backing up to each other. Just a thought. On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote: On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote: Those options don't fix my problem with closing times the next morning, or is there a way to do this? Use the spatial model, and a time window of a week. There are 10,080 minutes in a week, so you could use that as your scale. Assuming the week starts at 00:00 Monday morning, you might index Monday 9:00-23:00 as 540:1380 Tuesday 9am-Wednesday 1am would be 1980:2940 You convert your NOW time into a minutes since Monday 00:00 and do a spatial search within that time. If it is now Monday, 11:23am, that would be 11*60+23=683, so you would do a search for 683:683. If you have a shop that is open over Sunday night to Monday, you just list it as open until Sunday 23:59 and open again Monday 00:00. Would that do it? Upayavira -- Darren
Re: Search opening hours
Sure - and sorry for its density. I reread it and thought the same ;) So imagine a polygon of say 1/2 mile width (I made that up) that stretches around the equator. Let's call this a week's timeline and subdivide it into 7 blocks, one for each day. For the sake of simplicity assume it's a line (which I forget but is supported in Solr as an infinitely small polygon) starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for Sunday at 11:59 PM. By subdivide you can think of it either radially or by longitude, but you have 360 degrees to divide into 7, which means that every hour is represented by a range of roughly 2.143 degrees (360/7/24). These regions represent each day and hour (or less), and the region boundaries represent midnight for the day before. Now for indexing - your open hours then become a combination of these subdivisions. If you're open 24x7 then the whole polygon is indexed. If you're only open on Monday from 9-5 then only the polygon between (0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you can index any combination of times this way. So now the varsity question is how to do this with a fluctuating calendar? I think this example can be extended to include searching against any given day of the week in a year, or years. Just imagine a translation layer that adjusts the latitude N or S by some amount to represent which day in which year you're looking for. Make sense? On Wed, Aug 26, 2015 at 10:50 AM, Upayavira u...@odoko.co.uk wrote: delightfully dense = really intriguing, but I couldn't quite understand it - really hoping for more info On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote: Darren, That was delightfully dense. Do you think you could unpack it a bit more? Possibly some sample (pseudo) queries? Upayavira On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote: If you wanted to try a spatial approach that blended times like above, you could try a polygon of minimum width that spans the globe - this is literally using spatial search (geocodes) against time. So in this scenario you logically subdivide the polygon into 7 distinct regions (for days) and then within this you can defined, like a timeline, what open and closed means. The problem of 3AM is taken care of because of it's continuous nature - ie one day is adjacent to the next, with Sunday and Monday backing up to each other. Just a thought. On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote: On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote: Those options don't fix my problem with closing times the next morning, or is there a way to do this? Use the spatial model, and a time window of a week. There are 10,080 minutes in a week, so you could use that as your scale. Assuming the week starts at 00:00 Monday morning, you might index Monday 9:00-23:00 as 540:1380 Tuesday 9am-Wednesday 1am would be 1980:2940 You convert your NOW time into a minutes since Monday 00:00 and do a spatial search within that time. If it is now Monday, 11:23am, that would be 11*60+23=683, so you would do a search for 683:683. If you have a shop that is open over Sunday night to Monday, you just list it as open until Sunday 23:59 and open again Monday 00:00. Would that do it? Upayavira -- Darren -- Darren
Solr 4.10.3 start up issue
Hi everyone - I posted a question on stackoverflow but in hindsight this would have been a better place to start. Below is the link. Basically I can't get the example working when using an external ZK cluster and auto-core discovery. Solr 4.10.1 works fine, but the newest release never gets new nodes into the active state. There are no errors or warnings, and compared to the log output of 4.10.1, the difference is that nodes never make it to leader election. Here is the stackoverflow question, along with the full log output: http://stackoverflow.com/questions/28004832/solr-4-10-3-is-not-proceeding-to-leader-election-on-new-cluster-startup-hangs Any help and guidance would be appreciated. Thanks! -- Darren
Re: Solr 4.10.3 start up issue
Thanks Hoss, this is exactly what I needed. I had previously run the example using nothing more than an external ZK hosting my own configuration. This of course means one of two things - my conf was bad, or Solr was at fault. The conf has been working for ages so I didn't test a replacement (it's amazing how a little frustration can fuel such hubris). I had thought to do this before - and should have; I uploaded the full example collection configuration to ZK just now and tried again. Magic, it worked, which left me feeling a bit glum. Well, happy that it wasn't Solr. Now if you'll excuse me, I have a conf review to perform. Darren On Wed, Jan 21, 2015 at 6:48 PM, Chris Hostetter hossman_luc...@fucit.org wrote: : I posted a question on stackoverflow but in hindsight this would have been : a better place to start. Below is the link. : : Basically I can't get the example working when using an external ZK cluster : and auto-core discovery. Solr 4.10.1 works fine, but the newest release your SO URL shows the output of using your custom configs, but not what you got with the example configs -- so it's not clear to me if there is really just one problem, or perhaps 2? you also mentioned a lot of details about how you are using solr with zk, and what doens't work, but it's not clear if you tried other simpler steps using your configs -- or the example configs -- and if those simpler *did* work (ie: single node solr startup?) my best guess, based on the logs you did post and the mention of lib/mq/solr-search-ahead-2.0.0.jar in those logs, is that the entire question of zk and slcuster state and leaders is a red herring, and what you are running into is: SOLR-6643... https://issues.apache.org/jira/browse/SOLR-6643 ...if i'm right, then simple core discovery with your configs on a single node solr instance w/o any knowledge of ZK will also fail to init the core -- and if you try to use the CoreAdmin API to CREATE a core, you'll ge some kind of LinkageError. : Here is the stackoverflow question, along with the full log output: : http://stackoverflow.com/questions/28004832/solr-4-10-3-is-not-proceeding-to-leader-election-on-new-cluster-startup-hangs -Hoss http://www.lucidworks.com/ -- Darren
RE: SolrCloud replica dies under high throughput
Thanks that helped. I no longer see the constant replica recovery. It also increased my throughput to 1.6/1.7 million per hour reliably. I actually then tried using SSDs instead and it flew up to 6.5 million updates per hour. Setup: 4 node cluster using m3.2xl AWS servers using general purpose SSDs. Thanks again, Darren -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: 22 July 2014 00:25 To: solr-user@lucene.apache.org Subject: Re: SolrCloud replica dies under high throughput Looks like you probably have to raise the http client connection pool limits to handle that kind of load currently. They are specified as top level config in solr.xml: maxUpdateConnections maxUpdateConnectionsPerHost -- Mark Miller about.me/markrmiller On July 21, 2014 at 7:14:59 PM, Darren Lee (d...@amplience.com) wrote: Hi, I'm doing some benchmarking with Solr Cloud 4.9.0. I am trying to work out exactly how much throughput my cluster can handle. Consistently in my test I see a replica go into recovering state forever caused by what looks like a timeout during replication. I can understand the timeout and failure (I am hitting it fairly hard) but what seems odd to me is that when I stop the heavy load it still does not recover the next time it tries, it seems broken forever until I manually go in, clear the index and let it do a full resync. Is this normal? Am I misunderstanding something? My cluster has 4 nodes (2 shards, 2 replicas) (AWS m3.2xlarge). I am indexing with ~800 concurrent connections and a 10 sec soft commit. I consistently get this problem with a throughput of around 1.5 million documents per hour. Thanks all, Darren Stack Traces Messages: [qtp779330563-627] ERROR org.apache.solr.servlet.SolrDispatchFilter â null:org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnecti on(PoolingClientConnectionManager.java:226) at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnecti on(PoolingClientConnectionManager.java:195) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequ estDirector.java:422) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpC lient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpC lient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpC lient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpC lient.java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.ru n(ConcurrentUpdateSolrServer.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j ava:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. java:615) at java.lang.Thread.run(Thread.java:724) Error while trying to recover. core=assets_shard2_replica1:java.util.concurrent.ExecutionException: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://xxx.xxx.15.171:8080/solr at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:188) at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStr ategy.java:615) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.jav a:371) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235) Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://xxx.xxx.15.171:8080/solr at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSol rServer.java:566) at org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer .java:245) at org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer .java:241) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j ava:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor. java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(Abstract SessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer .java:84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSe ssionInputBuffer.java:273) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultH ttpResponseParser.java:140) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead
SolrCloud replica dies under high throughput
Hi, I'm doing some benchmarking with Solr Cloud 4.9.0. I am trying to work out exactly how much throughput my cluster can handle. Consistently in my test I see a replica go into recovering state forever caused by what looks like a timeout during replication. I can understand the timeout and failure (I am hitting it fairly hard) but what seems odd to me is that when I stop the heavy load it still does not recover the next time it tries, it seems broken forever until I manually go in, clear the index and let it do a full resync. Is this normal? Am I misunderstanding something? My cluster has 4 nodes (2 shards, 2 replicas) (AWS m3.2xlarge). I am indexing with ~800 concurrent connections and a 10 sec soft commit. I consistently get this problem with a throughput of around 1.5 million documents per hour. Thanks all, Darren Stack Traces Messages: [qtp779330563-627] ERROR org.apache.solr.servlet.SolrDispatchFilter â null:org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) at org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:422) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) Error while trying to recover. core=assets_shard2_replica1:java.util.concurrent.ExecutionException: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://xxx.xxx.15.171:8080/solr at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:188) at org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:615) at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:371) at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235) Caused by: org.apache.solr.client.solrj.SolrServerException: IOException occured when talking to server at: http://xxx.xxx.15.171:8080/solr at org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566) at org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer.java:245) at org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer.java:241) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: java.net.SocketException: Socket closed at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123
SolrCloud - Highly Reliable / Scalable Resources?
Hi everyone, We have been using Solr Cloud (4.4) for ~ 6 months now. Functionally its excellent but we have suffered several issues which always seem quite problematic to resolve. I was wondering if anyone in the community can recommend good resources / reading for setting up a highly scalable / highly reliable cluster. A lot of what I see in the solr documentation is aimed at small setups or is quite sparse. Dealing with topics like: * Capacity planning * Losing nodes * Voting panic * Recovery failure * Replication factors * Elasticity / Auto scaling / Scaling recipes * Exhibitor * Container configuration, concurrency limits, packet drop tuning * Increasing capacity without downtime * Scalable approaches to full indexing hundreds of millions of documents * External health check vs CloudSolrServer * Separate vs local zookeeper * Benchmarks Sorry, I know that's a lot to ask heh. We are going to run a project for a month or so soon where we re-write all our run books and do deeper testing on various failure scenarios and the above but any starting point would be much appreciated. Thanks all, Darren
SolrCloud - Highly Reliable / Scalable Info
Hi everyone, We have been using Solr Cloud (4.4) for ~ 6 months now. Functionally its excellent but we have suffered several issues which always seem quite problematic to resolve. I was wondering if anyone in the community can recommend good resources / reading for setting up a highly scalable / highly reliable cluster. A lot of what I see in the solr documentation is aimed at small setups or is quite sparse. Dealing with topics like: * Capacity planning * Losing nodes * Voting panic * Recovery failure * Replication factors * Elasticity / Auto scaling / Scaling recipes * Exhibitor * Container configuration, concurrency limits, packet drop tuning * Increasing capacity without downtime * Scalable approaches to full indexing hundreds of millions of documents * External health check vs CloudSolrServer * Separate vs local zookeeper * Benchmarks Sorry, I know that's a lot to ask heh. We are going to run a project for a month or so soon where we re-write all our run books and do deeper testing on various failure scenarios and the above but any starting point would be much appreciated. Thanks all, Darren
MLT in SolrJ vs. URL?
Hi, I compose a mlt query in a URL and get the queried result back and a list of documents in the moreLikeThis section in my browser. When I try to execute the same query in SolrJ setting the same params, I only get the queried result document back and no MLT docs. What's the trick here? thanks, Darren
Re: zk Config URL?
(AbstractInhabitantImpl.java:78) at com.sun.enterprise.v3.server.AppServerStartup.run(AppServerStartup.java:253) at com.sun.enterprise.v3.server.AppServerStartup.doStart(AppServerStartup.java:145) at com.sun.enterprise.v3.server.AppServerStartup.start(AppServerStartup.java:136) at com.sun.enterprise.glassfish.bootstrap.GlassFishImpl.start(GlassFishImpl.java:79) at com.sun.enterprise.glassfish.bootstrap.GlassFishDecorator.start(GlassFishDecorator.java:63) at com.sun.enterprise.glassfish.bootstrap.osgi.OSGiGlassFishImpl.start(OSGiGlassFishImpl.java:69) at com.sun.enterprise.glassfish.bootstrap.GlassFishMain$Launcher.launch(GlassFishMain.java:117) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.enterprise.glassfish.bootstrap.GlassFishMain.main(GlassFishMain.java:97) at com.sun.enterprise.glassfish.bootstrap.ASMain.main(ASMain.java:55) Caused by: java.lang.ClassNotFoundException: javax.servlet.Filter at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at sun.misc.Launcher$ExtClassLoader.findClass(Launcher.java:229) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 55 more On 02/24/2013 08:32 PM, Mark Miller wrote: You either have to specifically upload a config set or use one of the bootstrap sys props. Are you doing either? - Mark On Feb 24, 2013, at 8:15 PM, Darren Govoni dar...@ontrenet.com wrote: Thanks Michael. I went ahead and just started an external zookeeper, but my solr node throws exceptions from it. Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find configName for collection collection1 found:null ... [#|2013-02-24T20:13:58.451-0500|SEVERE|glassfish3.1.2|org.apache.solr.core.CoreContainer|_ThreadID=28;_ThreadName=Thread-2;|null:org.apache.solr.common.SolrException: Unable to create core: collection1 at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find configName for collection collection1 found:null at org.apache.solr.cloud.ZkController.getConfName(ZkController.java:1097) at org.apache.solr.cloud.ZkController.createCollectionZkNode(ZkController.java:1016) at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:937) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031) ... 10 more On 02/24/2013 07:21 PM, Michael Della Bitta wrote: Hello Darren, If you go into the admin and click on Cloud, you'll see that information represented in a number of ways. Both Dump and Tree (especially the clusterstate.json file) have this information represented as a document in JSON format. If you don't see the Cloud navigation on the left side of the admin screen, that's a good indication that Solr hasn't connected to Zookeeper. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Sun, Feb 24, 2013 at 6:34 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't find that shows me the zookeeper config XML, so I can check what other nodes are connected? Can't seem to find it. I deploy my solrcloud war into glassfish and set jetty.port (among other properties) to the GF domain port (e.g. 8181).' It starts successfully. I want zookeeper to run automatically within (as needed). How can I verify this or refer to the first/master server using zkHost from another node? (e.g. {host}:{port}) to form a cluster. I did this before a while ago, before solr 4.x was released, but things have changed. tips appreciated. thank you. Darren
Re: zk Config URL?
Ok. But its way too complicated than it should be. It should work smarter. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Anirudha Jadhav aniru...@nyu.edu Date: To: solr-user@lucene.apache.org Subject: Re: zk Config URL? Solr cloud reads solr cfg files from zookeeper. You need to push the cfg to zookeeper link collection to cfg. This is exactly what mark suggested earlier in the thread. This is also explained in solr cloud wiki. On Monday, February 25, 2013, Darren Govoni wrote: Hi Mark, I download latest zk, and run it. In my glassfish server, I set these system wide properties: numShards = 1 zkHost = 10.x.x.x:2181 jetty.port = 8080 (port of my domain) bootstrap_config = true I copy all the solr 4.1 dist/*.jar into my glassfish domain lib/ext directory. Then I deploy solr 4.1 war. It throws this exception always. [#|2013-02-25T13:31:32.304+**|INFO|glassfish3.1.2|** javax.enterprise.system.**container.web.com.sun.** enterprise.web|_ThreadID=10;_**ThreadName=Thread-2;|WEB0171: Created virtual server [__asadmin]|#] [#|2013-02-25T13:31:32.768+**|INFO|glassfish3.1.2|** javax.enterprise.system.**container.web.com.sun.** enterprise.web|_ThreadID=10;_**ThreadName=Thread-2;|WEB0172: Virtual server [server] loaded default web module []|#] [#|2013-02-25T13:31:34.222+**|WARNING|glassfish3.1.2|** javax.enterprise.system.tools.**deployment.org.glassfish.** deployment.common|_ThreadID=**10;_ThreadName=Thread-2;|**DPL8007: Unsupported deployment descriptors element schemaLocation value http://www.bea.com/ns/**weblogic/90 http://www.bea.com/ns/weblogic/90 http://www.bea.com/ns/**weblogic/90/weblogic-web-app.**xsd|#http://www.bea.com/ns/weblogic/90/weblogic-web-app.xsd%7C# ] [#|2013-02-25T13:31:34.223+**|SEVERE|glassfish3.1.2|** javax.enterprise.system.tools.**deployment.org.glassfish.** deployment.common|_ThreadID=**10;_ThreadName=Thread-2;|**DPL8006: get/add descriptor failure : filter-dispatched-requests-**enabled TO false|#] [#|2013-02-25T13:31:34.831+**|SEVERE|glassfish3.1.2|** javax.enterprise.system.**container.web.com.sun.** enterprise.web|_ThreadID=10;_**ThreadName=Thread-2;|**WebModule[/solr1]PWC1270: Exception starting filter SolrRequestFilter java.lang.**NoClassDefFoundError: javax/servlet/Filter at java.lang.ClassLoader.**defineClass1(Native Method) at java.lang.ClassLoader.**defineClassCond(ClassLoader.**java:631) at java.lang.ClassLoader.**defineClass(ClassLoader.java:**615) at java.security.**SecureClassLoader.defineClass(** SecureClassLoader.java:141) at java.net.URLClassLoader.**defineClass(URLClassLoader.**java:283) at java.net.URLClassLoader.**access$000(URLClassLoader.**java:58) at java.net.URLClassLoader$1.run(**URLClassLoader.java:197) at java.security.**AccessController.doPrivileged(**Native Method) at java.net.URLClassLoader.**findClass(URLClassLoader.java:**190) at sun.misc.Launcher$**ExtClassLoader.findClass(**Launcher.java:229) at java.lang.ClassLoader.**loadClass(ClassLoader.java:**306) at java.lang.ClassLoader.**loadClass(ClassLoader.java:**295) at com.sun.enterprise.v3.server.**APIClassLoaderServiceImpl$** APIClassLoader.loadClass(**APIClassLoaderServiceImpl.**java:206) at java.lang.ClassLoader.**loadClass(ClassLoader.java:**295) at java.lang.ClassLoader.**loadClass(ClassLoader.java:**295) at java.lang.ClassLoader.**loadClass(ClassLoader.java:**247) at org.glassfish.web.loader.**WebappClassLoader.loadClass(** WebappClassLoader.java:1456) at org.glassfish.web.loader.**WebappClassLoader.loadClass(** WebappClassLoader.java:1359) at org.apache.catalina.core.**ApplicationFilterConfig.** loadFilterClass(**ApplicationFilterConfig.java:**280) at org.apache.catalina.core.**ApplicationFilterConfig.**getFilter(** ApplicationFilterConfig.java:**250) at org.apache.catalina.core.**ApplicationFilterConfig.init** (ApplicationFilterConfig.java:**120) at org.apache.catalina.core.**StandardContext.filterStart(** StandardContext.java:4685) at org.apache.catalina.core.**StandardContext.start(** StandardContext.java:5377) at com.sun.enterprise.web.**WebModule.start(WebModule.**java:498) at org.apache.catalina.core.**ContainerBase.**addChildInternal(** ContainerBase.java:917) at org.apache.catalina.core.**ContainerBase.addChild(** ContainerBase.java:901) at org.apache.catalina.core.**StandardHost.addChild(** StandardHost.java:733) at com.sun.enterprise.web.**WebContainer.loadWebModule(** WebContainer.java:2019) at com.sun.enterprise.web.**WebContainer.loadWebModule(** WebContainer.java:1669) at com.sun.enterprise.web.**WebApplication.start(** WebApplication.java:109) at org.glassfish.internal.data.**EngineRef.start(EngineRef.**java:130) at org.glassfish.internal.data.**ModuleInfo.start(ModuleInfo.** java:269
zk Config URL?
Hi, I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't find that shows me the zookeeper config XML, so I can check what other nodes are connected? Can't seem to find it. I deploy my solrcloud war into glassfish and set jetty.port (among other properties) to the GF domain port (e.g. 8181).' It starts successfully. I want zookeeper to run automatically within (as needed). How can I verify this or refer to the first/master server using zkHost from another node? (e.g. {host}:{port}) to form a cluster. I did this before a while ago, before solr 4.x was released, but things have changed. tips appreciated. thank you. Darren
Re: zk Config URL?
Thanks Michael. I went ahead and just started an external zookeeper, but my solr node throws exceptions from it. Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find configName for collection collection1 found:null ... [#|2013-02-24T20:13:58.451-0500|SEVERE|glassfish3.1.2|org.apache.solr.core.CoreContainer|_ThreadID=28;_ThreadName=Thread-2;|null:org.apache.solr.common.SolrException: Unable to create core: collection1 at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629) at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find configName for collection collection1 found:null at org.apache.solr.cloud.ZkController.getConfName(ZkController.java:1097) at org.apache.solr.cloud.ZkController.createCollectionZkNode(ZkController.java:1016) at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:937) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031) ... 10 more On 02/24/2013 07:21 PM, Michael Della Bitta wrote: Hello Darren, If you go into the admin and click on Cloud, you'll see that information represented in a number of ways. Both Dump and Tree (especially the clusterstate.json file) have this information represented as a document in JSON format. If you don't see the Cloud navigation on the left side of the admin screen, that's a good indication that Solr hasn't connected to Zookeeper. Michael Della Bitta Appinions 18 East 41st Street, 2nd Floor New York, NY 10017-6271 www.appinions.com Where Influence Isn’t a Game On Sun, Feb 24, 2013 at 6:34 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't find that shows me the zookeeper config XML, so I can check what other nodes are connected? Can't seem to find it. I deploy my solrcloud war into glassfish and set jetty.port (among other properties) to the GF domain port (e.g. 8181).' It starts successfully. I want zookeeper to run automatically within (as needed). How can I verify this or refer to the first/master server using zkHost from another node? (e.g. {host}:{port}) to form a cluster. I did this before a while ago, before solr 4.x was released, but things have changed. tips appreciated. thank you. Darren
RE: SolrJ and Solr 4.0 | doc.getFieldValue() returns String instead of Date
SimpleDateFormat df= new SimpleDateFormat(-MM-dd'T'hh:mm:ss.S'Z'); Date dateObj = df.parse(2009-10-29T00:00:009Z); brbrbr--- Original Message --- On 1/8/2013 09:34 AM uwe72 wrote:brA Lucene 4.0 document returns for a Date field now a string value, instead of bra Date object. br brfield name=ModuleImpl.versionAsDate view=Datenstand type=date br brSolr4.0 -- 2009-10-29T00:00:009Z brSolr3.6 -- Date instance br brCan this be set somewhere in the config? br brI prefer to receive a date instance br br br br-- brView this message in context: http://lucene.472066.n3.nabble.com/SolrJ-and-Solr-4-0-doc-getFieldValue-returns-String-instead-of-Date-tp4031588.html brSent from the Solr - User mailing list archive at Nabble.com. br
RE: RE: Max number of core in Solr multi-core
This should be clarified some. In the client API, SolrServer is represents a connection to a single server backend/endpoint and should be re-used where possible. The approach being discussed is to have one client connection (represented by SolrServer class) per solr core, all residing in a single solr server (as is the case below, but not required). brbrbr--- Original Message --- On 1/7/2013 08:06 AM Jay Parashar wrote:brThis is the exact approach we use in our multithreaded env. One server per brcore. I think this is the recommended approach. br br-Original Message- brFrom: Parvin Gasimzade [mailto:parvin.gasimz...@gmail.com] brSent: Monday, January 07, 2013 7:00 AM brTo: solr-user@lucene.apache.org brSubject: Re: Max number of core in Solr multi-core br brI know that but my question is different. Let me ask it in this way. br brI have a solr with base url localhost:8998/solr and two solr core as brlocalhost:8998/solr/core1 and localhost:8998/solr/core2. br brI have one baseSolr instance initialized as : brSolrServer server = new HttpSolrServer( url ); br brI have also create SolrServer's for each core as : brSolrServer core1 = new HttpSolrServer( url + /core1 ); SolrServer core2 = brnew HttpSolrServer( url + /core2 ); br brSince there are many cores, I have to initialize SolrServer as shown above. brIs there a way to create only one SolrServer with the base url and access breach core using it? If it is possible, then I don't need to create new brSolrServer for each core. br brOn Mon, Jan 7, 2013 at 2:39 PM, Erick Erickson brerickerick...@gmail.comwrote: br br This might help: br https://wiki.apache.org/solr/Solrj#HttpSolrServer br br Note that the associated SolrRequest takes the path, I presume br relative to the base URL you initialized the HttpSolrServer with. br br Best br Erick br br br On Mon, Jan 7, 2013 at 7:02 AM, Parvin Gasimzade br parvin.gasimz...@gmail.com br wrote: br br Thank you for your responses. I have one more question related to br Solr multi-core. br By using SolrJ I create new core for each application. When user br wants to add data or make query on his application, I create new br HttpSolrServer br for br this core. In this scenario there will be many running br HttpSolrServer instances. br br Is there a better solution? Does it cause a problem to run many br instances at the same time? br br On Wed, Jan 2, 2013 at 5:35 PM, Per Steffensen st...@designware.dk br wrote: br br g a collection per application instead of a core br br br br
Re: Terminology question: Core vs. Collection vs...
This is a good explanation and makes sense. The one inconsistency is referring to a replica of a shard that has no replication. But its not that big of a problem. If you wove the term 'core' into your writeup below it would be complete and should be posted on the wiki. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Jack Krupansky j...@basetechnology.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Replication makes perfect sense even if our explanations so far do not. A shard is an abstraction of a subset of the data for a collection. A replica is an instance of the data of the shard and instances of Solr servers that have indicated a readiness to service queries and updates for the data. Alternatively, a replica is a node which has indicated a readiness to receive and serve the data of a shard, but may not have any data at the moment. Lets describe it operationally for SolrCloud: If data comes in to any replica of a shard it will automatically and quickly be replicated to all other replicas of the shard. If a new replica of a shard comes up it will be streamed all of the data from the another replica of the shard. If an existing replica of a shard restarts or reconnects to the cluster, it will be streamed updates of any new data since it was last updated from another replica of the shard. Replication is simply the process of assuring that all replicas are kept up to date. That's the same abstract meaning as for Master/Slave even though the operational details are somewhat different. The goal remains the same. Replication factor is the number of instances of the data of the shard and instances of Solr servers that can service queries and updates for the data. Alternatively, the replication factor is the number of nodes of the SolrCloud cluster which have indicated a readiness to receive and serve the data of a shard, but may not have any data at the moment. A node is an instance of Solr running in a Java JVM that has indicated to the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a replica for a shard of a collection. [The latter part of that is a bit too fuzzy - I'm not sure what the node tells Zookeeper and who does shard assignment. I mean, does a node explicitly say what shard it wants to be, or is that assigned by Zookeeper, or is that a node's choice/option? But none of that changes the fact that a node registers with Zookeeper and then somehow becomes a replica for a shard.] A node (instance of a Solr server) can be a replica of shards from multiple collections (potentially multiple shards per collection). A node is not a replica per se, but a container that can serve multiple collections. A node can serve as multiple replicas, each of a different collection. My only interest here on this user list is to understand and explain the terms we have today and that SEEM to be working for the most part, even though we may not have defined them carefully enough and used them consistently enough. If somebody want to propose an alternative terminology - fine, discuss that on the dev list and/or file a Jira. I won't claim that my definitions are perfect (yet), but perfecting the definitions (for users) should be separated from changing the terms themselves. -- Jack Krupansky -Original Message- From: Per Steffensen Sent: Friday, January 04, 2013 2:49 AM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... On 1/3/13 5:58 PM, Walter Underwood wrote: A factor is multiplied, so multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that shard. I think that recycling the term replication within Solr was confusing, but it is a bit late to change that. wunder Yes, the term factor is not misleading, but the term replication is. If we keep calling shard-instances for Replica I guess replicaFactor will be ok - at least much better than replicationFactor. But it would still be better with e.g. ShardInstance and InstancesPerShard
Re: Terminology question: Core vs. Collection vs...
Yes. Thats it. Its clear if we separate logical terms from physical terms. A simple cake diagram on the wiki along with perhaps a uml will solidify these concepts. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Jack Krupansky j...@basetechnology.com Date: To: solr-user@lucene.apache.org,darren dar...@ontrenet.com Subject: Re: Terminology question: Core vs. Collection vs... I thought about adding Solr core, but it only muddies the water. Yes, it needs to be added, but carefully. In the context of SolrCloud, a Solr core is the underlying representation of a replica. Alternatively, a replica of a shard of a collection is implemented as a Solr core. [Need to factor in the potential for multiple shards on a single node.] Or, a Solr core is capable of serving as a replica of a shard. A Solr core has a collection name but can exist without being registered with Zookeeper, so it may not be a replica of a zookeeper-registered collection. Something like that. Not quite there yet. The main point, I think, is that when we talk about SolrCloud or a Solr cluster it would be better for people to speak of replicas and shards and collections than cores since core is the implementation rather than the abstraction. I mean, at the level of cores, they know of only documents and fields, not shards, replicas, and the overall structure of collections and the cluster. Sure, the core has the name of the collection, but cores on other nodes can use that same name. -- Jack Krupansky -Original Message- From: darren Sent: Friday, January 04, 2013 9:00 AM To: j...@basetechnology.com ; solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... This is a good explanation and makes sense. The one inconsistency is referring to a replica of a shard that has no replication. But its not that big of a problem. If you wove the term 'core' into your writeup below it would be complete and should be posted on the wiki. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Jack Krupansky j...@basetechnology.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Replication makes perfect sense even if our explanations so far do not. A shard is an abstraction of a subset of the data for a collection. A replica is an instance of the data of the shard and instances of Solr servers that have indicated a readiness to service queries and updates for the data. Alternatively, a replica is a node which has indicated a readiness to receive and serve the data of a shard, but may not have any data at the moment. Lets describe it operationally for SolrCloud: If data comes in to any replica of a shard it will automatically and quickly be replicated to all other replicas of the shard. If a new replica of a shard comes up it will be streamed all of the data from the another replica of the shard. If an existing replica of a shard restarts or reconnects to the cluster, it will be streamed updates of any new data since it was last updated from another replica of the shard. Replication is simply the process of assuring that all replicas are kept up to date. That's the same abstract meaning as for Master/Slave even though the operational details are somewhat different. The goal remains the same. Replication factor is the number of instances of the data of the shard and instances of Solr servers that can service queries and updates for the data. Alternatively, the replication factor is the number of nodes of the SolrCloud cluster which have indicated a readiness to receive and serve the data of a shard, but may not have any data at the moment. A node is an instance of Solr running in a Java JVM that has indicated to the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a replica for a shard of a collection. [The latter part of that is a bit too fuzzy - I'm not sure what the node tells Zookeeper and who does shard assignment. I mean, does a node explicitly say what shard it wants to be, or is that assigned by Zookeeper, or is that a node's choice/option? But none of that changes the fact that a node registers with Zookeeper and then somehow becomes a replica for a shard.] A node (instance of a Solr server) can be a replica of shards from multiple collections (potentially multiple shards per collection). A node is not a replica per se, but a container that can serve multiple collections. A node can serve as multiple replicas, each of a different collection. My only interest here on this user list is to understand and explain the terms we have today and that SEEM to be working for the most part, even though we may not have defined them carefully enough and used them consistently enough. If somebody want to propose an alternative terminology - fine, discuss that on the dev list and/or file a Jira. I won't claim that my definitions are perfect (yet
Re: Terminology question: Core vs. Collection vs...
Agreed. But for completeness can it be node/collection/shard/replica/core? Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Yonik Seeley yo...@lucidworks.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote: Our biggest problem is that we really havent decided once and for all and made sure to reflect the decision consistently across code and documentation. As long as we havnt I believe it is still ok to change our minds. IMO, I *think* it's settled: It's collection consists of 1 or more shards, which each consist of one or more replicas. A *long* time ago (3 years actually), I tried to get slice used in place of shard just because shard was already used ambiguously by people for both physical and logical shards, but it never caught on, and as I recall no one could really agree on a set of terms that satisfied everyone. Attempting to replace Replica with something like Shard Instance could actually end up being worse since it's a mouthful and people would tend to shorten it to shard when talking about it. From a practical standpoint, I don't think people will be confused by the current terminology once we document it well (we should probably start with collection/shard/replica). It's mostly an issue of when one goes looking for inconsistencies or things that might not make sense. And as has been pointed out, others use the exact same terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication In the *code* I have been migrating away from shard as the physical kind. I've also used slice as a synonym for logical shard in the code because of this mixed history of shard and since removing all remnants of the use of shard as physical all at once would be impractical. Anyone who works on the code should not be bothered by an extra synonym, and things will continue to be cleaned up over time. -Yonik http://lucidworks.com
Re: Terminology question: Core vs. Collection vs...
Actually. Node/collection/shard/replica/core/index Sent from my Verizon Wireless 4G LTE Smartphone Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core? Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Yonik Seeley yo...@lucidworks.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote: Our biggest problem is that we really havent decided once and for all and made sure to reflect the decision consistently across code and documentation. As long as we havnt I believe it is still ok to change our minds. IMO, I *think* it's settled: It's collection consists of 1 or more shards, which each consist of one or more replicas. A *long* time ago (3 years actually), I tried to get slice used in place of shard just because shard was already used ambiguously by people for both physical and logical shards, but it never caught on, and as I recall no one could really agree on a set of terms that satisfied everyone. Attempting to replace Replica with something like Shard Instance could actually end up being worse since it's a mouthful and people would tend to shorten it to shard when talking about it. From a practical standpoint, I don't think people will be confused by the current terminology once we document it well (we should probably start with collection/shard/replica). It's mostly an issue of when one goes looking for inconsistencies or things that might not make sense. And as has been pointed out, others use the exact same terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication In the *code* I have been migrating away from shard as the physical kind. I've also used slice as a synonym for logical shard in the code because of this mixed history of shard and since removing all remnants of the use of shard as physical all at once would be impractical. Anyone who works on the code should not be bothered by an extra synonym, and things will continue to be cleaned up over time. -Yonik http://lucidworks.com
Re: Terminology question: Core vs. Collection vs...
My understanding is core is a logical solr term. Index is a physical lucene term. A solr core is backed by a physical lucene index. One index per core. Solr team can correct me if its not accurate. :) Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Alexandre Rafalovitch arafa...@gmail.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core?
Re: Terminology question: Core vs. Collection vs...
I agree. In my opinion index is a low level lucene thing. I never say a collection has an index directly. That confuses levels and creates confusion. To me at least. I think the terminology discussed is good. Just some lingering usage inconsistencies. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Alexandre Rafalovitch arafa...@gmail.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Hmm. Doesn't that make (logical) index=collection? And (physical) index=core? Which creates duplication of terminology and at the same time can cause confusion between highest logical and lowest physical level. Regards, Alex. P.s. Hoping not to start a new terminology war. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 1:21 PM, Jack Krupansky j...@basetechnology.comwrote: The entire collection does have an index - a distributed index - which consists of a Lucene index on each core/replica for the subset of the data in that shard. -- Jack Krupansky -Original Message- From: Alexandre Rafalovitch Sent: Friday, January 04, 2013 1:12 PM To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yonik@**lucidworks.com yo...@lucidworks.com, solr-user@**lucene.apache.org solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/**core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-**u...@lucene.apache.orgsolr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/** core?
Re: Terminology question: Core vs. Collection vs...
Good point. Agree. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Upayavira u...@odoko.co.uk Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Using your terminology, I'd say core is a physical solr term, and index is a pysical lucene term. A collection or a shard is a logical solr term. Upayavira On Fri, Jan 4, 2013, at 06:28 PM, darren wrote: My understanding is core is a logical solr term. Index is a physical lucene term. A solr core is backed by a physical lucene index. One index per core. Solr team can correct me if its not accurate. :) Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Alexandre Rafalovitch arafa...@gmail.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core?
Re: Terminology question: Core vs. Collection vs...
Yes. In that case, core should best be described as a logical solr entity with various managed attributes and qualities above the physical layer (sorry, not trying to perpetuate this thread so much). On 01/04/2013 01:55 PM, Mark Miller wrote: Currently a SolrCore is 1:1 with a low level Lucene index. There is no reason that needs to alway be that way. It's possible that we may at some point add built in micro sharding support that means a SolrCore could have multiple underlying Lucene indexes. Or we may not. - Mark On Jan 4, 2013, at 1:49 PM, darren dar...@ontrenet.com wrote: Good point. Agree. Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Upayavira u...@odoko.co.uk Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Using your terminology, I'd say core is a physical solr term, and index is a pysical lucene term. A collection or a shard is a logical solr term. Upayavira On Fri, Jan 4, 2013, at 06:28 PM, darren wrote: My understanding is core is a logical solr term. Index is a physical lucene term. A solr core is backed by a physical lucene index. One index per core. Solr team can correct me if its not accurate. :) Sent from my Verizon Wireless 4G LTE Smartphone Original message From: Alexandre Rafalovitch arafa...@gmail.com Date: To: solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Can I just start by saying that this was AMAZING. :-) When I asked the question, I certainly did not expect this level of details. And I vote on the cake diagram for WIKI as well. Perhaps, two with the first one showing the trivial collapsed state of single collection/shard/replica/core. The trivial one will also help to explain why the example is now called 'collection1'. I think I followed everything, except for just added term of 'index'. Isn't that the same as 'core'? Or can we have several indexes in one core? Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote: This is the containment hierarchy i understand but includes both physical and logical. Original message From: darren dar...@ontrenet.com Date: To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Actually. Node/collection/shard/replica/core/index Original message From: darren dar...@ontrenet.com Date: To: yo...@lucidworks.com,solr-user@lucene.apache.org Subject: Re: Terminology question: Core vs. Collection vs... Agreed. But for completeness can it be node/collection/shard/replica/core?
RE: Re: Terminology question: Core vs. Collection vs...
Good write up. And what about node? I think there needs to be an official glossary of terms that is sanctioned by the solr team and some terms still ni use may need to be labeled deprecated. After so many years, its still confusing. brbrbr--- Original Message --- On 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern term and incorporates the fact that the brcollection may be sharded, with each shard on one or more cores, with each brcore being a replica of the other cores within that shard of that brcollection. br brInstance is a general term, but is commonly used to refer to a running Solr brserver, each of which can service any number of cores. A sharded collection brwould typically require multiple instances of Solr, each with a shard of the brcollection. br brMultiple collections can be supported on a single instance of Solr. They brdon't have to be sharded or replicated. But if they are, each Solr instance brwill have a copy or replica of the data (index) of one shard of each sharded brcollection - to the degree that each collection needs that many shards. br brAt the API level, you talk to a Solr instance, using a host and port, and brgiving the collection name. Some operations will refer only to the portion brof a multi-shard collection on that Solr instance, but typically Solr will brdistribute the operation, whether it be an update or a query, to all of brthe shards of the named collection. In the case of update, the update will brbe distributed to all replicas as well, but in the case of query only one brreplica of each shard of the collection is needed. br brBefore SolrCloud we Solr had master and slave and the slaves were replicas brof the master, but with SolrCloud there is no master and all the replicas of brthe shard are peers, although at any moment of time one of them will be brconsidered the leader for coordination purposes, but not in the sense that brit is a master of the other replicas in that shard. A SolrCloud replica is a brreplica of the data, in an abstract sense, for a single shard of a brcollection. A SolrCloud replica is more of an instance of the data/index. br brAn index exists at two levels: the portion of a collection on a single Solr brcore will have a Lucene index, but collectively the Lucene indexes for the brshards of a collection can be referred to the index of the collection. Each brreplica is a copy or instance of a portion of the collection's index. br brThe term slice is sometimes used to refer collectively to all of the brcores/replicas of a single shard, or sometimes to a single replica as it brcontains only a slice of the full collection data. br br-- Jack Krupansky br br-Original Message- brFrom: Alexandre Rafalovitch brSent: Thursday, January 03, 2013 4:42 AM brTo: solr-user@lucene.apache.org brSubject: Terminology question: Core vs. Collection vs... br brHello, br brI am trying to understand the core Solr terminology. I am looking for brcorrect rather than loose meaning as I am trying to teach an example that brstarts from easy scenario and may scale to multi-core, multi-machine brsituation. br brHere are the terms that seem to be all overlapping and/or crossing over in brmy mind a the moment. br br1) Index br2) Core br3) Collection br4) Instance br5) Replica (Replica of _what_?) br6) Others? br brI tried looking through documentation, but either there is a terminology brdrift or I am having trouble understanding the distinctions. br brIf anybody has a clear picture in their mind, I would appreciate a brclarification. br brRegards, br Alex. br brPersonal blog: http://blog.outerthoughts.com/ brLinkedIn: http://www.linkedin.com/in/alexandrerafalovitch br- Time is the quality of nature that keeps events from happening all at bronce. Lately, it doesn't seem to be working. (Anonymous - via GTD book) br br
RE: Re: Terminology question: Core vs. Collection vs...
Thanks again. (And sorry to jump into this convo) But I had a question on your statement: On 1/3/2013 08:07 AM Jack Krupansky wrote: brCollection is the more modern term and incorporates the fact that the brcollection may be sharded, with each shard on one or more cores, with each brcore being a replica of the other cores within that shard of that brcollection. A collection is sharded, meaning it is distributed across cores. A shard itself is not distributed across cores in the same since. Rather a shard exist on a single core and is replicated on other cores. Is that right? The way its worded above, it sounds like a shard can also be sharded... brbrbr--- Original Message --- On 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a cluster or cloud (graph). It could be a real brmachine or a virtualized machine. Technically, you could have multiple brvirtual nodes on the same physical box. Each Solr replica would be on a brdifferent node. br brTechnically, you could have multiple Solr instances running on a single brhardware node, each with a different port. They are simply instances of brSolr, although you could consider each Solr instance a node in a Solr cloud bras well, a virtual node. So, technically, you could have multiple replicas bron the same node, but that sort of defeats most of the purpose of having brreplicas in the first place - to distribute the data for performance and brfault tolerance. But, you could have replicas of different shards on the brsame node/box for a partial improvement of performance and fault tolerance. br brA Solr cloud' is really a cluster. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:16 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brGood write up. br brAnd what about node? br brI think there needs to be an official glossary of terms that is sanctioned brby the solr team and some terms still ni use may need to be labeled brdeprecated. After so many years, its still confusing. br brbrbrbr--- Original Message --- brOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern brterm and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brbrcore being a replica of the other cores within that shard of that brbrcollection. brbr brbrInstance is a general term, but is commonly used to refer to a running brSolr brbrserver, each of which can service any number of cores. A sharded brcollection brbrwould typically require multiple instances of Solr, each with a shard of brthe brbrcollection. brbr brbrMultiple collections can be supported on a single instance of Solr. They brbrdon't have to be sharded or replicated. But if they are, each Solr brinstance brbrwill have a copy or replica of the data (index) of one shard of each brsharded brbrcollection - to the degree that each collection needs that many shards. brbr brbrAt the API level, you talk to a Solr instance, using a host and port, brand brbrgiving the collection name. Some operations will refer only to the brportion brbrof a multi-shard collection on that Solr instance, but typically Solr brwill brbrdistribute the operation, whether it be an update or a query, to all brof brbrthe shards of the named collection. In the case of update, the update brwill brbrbe distributed to all replicas as well, but in the case of query only brone brbrreplica of each shard of the collection is needed. brbr brbrBefore SolrCloud we Solr had master and slave and the slaves were brreplicas brbrof the master, but with SolrCloud there is no master and all the brreplicas of brbrthe shard are peers, although at any moment of time one of them will be brbrconsidered the leader for coordination purposes, but not in the sense brthat brbrit is a master of the other replicas in that shard. A SolrCloud replica bris a brbrreplica of the data, in an abstract sense, for a single shard of a brbrcollection. A SolrCloud replica is more of an instance of the brdata/index. brbr brbrAn index exists at two levels: the portion of a collection on a single brSolr brbrcore will have a Lucene index, but collectively the Lucene indexes for brthe brbrshards of a collection can be referred to the index of the collection. brEach brbrreplica is a copy or instance of a portion of the collection's index. brbr brbrThe term slice is sometimes used to refer collectively to all of the brbrcores/replicas of a single shard, or sometimes to a single replica as it brbrcontains only a slice of the full collection data. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Alexandre Rafalovitch brbrSent: Thursday, January 03, 2013 4:42 AM brbrTo: solr-user@lucene.apache.org brbrSubject: Terminology question: Core vs. Collection vs... brbr brbrHello, brbr brbrI am trying
RE: Re: Terminology question: Core vs. Collection vs...
Thanks. I got that part. A group of shards (and therefore cores) represent a collection, yes. But a single shard exist only on a single core? brbrbr--- Original Message --- On 1/3/2013 09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brbrbrcore being a replica of the other cores within that shard of that brbrbrcollection. brbrbr brbrbrInstance is a general term, but is commonly used to refer to a brrunning brbrSolr brbrbrserver, each of which can service any number of cores. A sharded brbrcollection brbrbrwould typically require multiple instances of Solr, each with a brshard of brbrthe brbrbrcollection. brbrbr brbrbrMultiple collections can be supported on a single instance of Solr. brThey brbrbrdon't have to be sharded or replicated. But if they are, each Solr brbrinstance brbrbrwill have a copy or replica of the data (index) of one shard of each brbrsharded brbrbrcollection - to the degree that each collection needs that many brshards. brbrbr brbrbrAt the API level, you talk to a Solr instance, using a host and brport, brbrand brbrbrgiving the collection name. Some operations will refer only to the brbrportion brbrbrof a multi-shard collection on that Solr instance, but typically brSolr brbrwill brbrbrdistribute the operation, whether it be an update or a query, to brall brbrof brbrbrthe shards of the named collection. In the case of update, the brupdate brbrwill brbrbrbe distributed to all replicas as well, but in the case of query bronly brbrone brbrbrreplica of each shard of the collection is needed. brbrbr brbrbrBefore SolrCloud we Solr had master and slave and the slaves were brbrreplicas brbrbrof the master, but with SolrCloud there is no master and all the brbrreplicas of brbrbrthe shard are peers, although at any moment of time one of them will brbe brbrbrconsidered the leader
RE: Re: Terminology question: Core vs. Collection vs...
I think what's confusing about your explanation below is when you have a situation where there is no replication factor. That's possible too, yes? So in that case, is each core of a shard of a collection, still referred to as a replica? To me a replica is a duplicate/backup of a shard's core. Not the sharded core itself. Or is there just no difference. Even a non-replicated core is called a replica? brbrbr--- Original Message --- On 1/3/2013 09:08 AM Jack Krupansky wrote:brOops... let me word that a little more carefully: br br...we are replicating the data of each shard. br br br br br br-- Jack Krupansky br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:03 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brbrbrcore being a replica of the other cores within that shard of that brbrbrcollection. brbrbr brbrbrInstance is a general term, but is commonly used to refer to a brrunning brbrSolr brbrbrserver, each of which can service any number of cores. A sharded brbrcollection brbrbrwould typically require multiple instances of Solr, each with a brshard of brbrthe brbrbrcollection. brbrbr brbrbrMultiple collections can be supported on a single instance of Solr. brThey brbrbrdon't have to be sharded or replicated. But if they are, each Solr brbrinstance brbrbrwill have a copy or replica of the data (index) of one shard of each brbrsharded brbrbrcollection - to the degree that each collection needs that many brshards. brbrbr brbrbrAt the API level, you talk to a Solr instance, using a host and brport, brbrand brbrbrgiving the collection name. Some operations will refer only to the brbrportion brbrbrof a multi-shard collection on that Solr instance, but typically brSolr brbrwill brbrbrdistribute the operation, whether it be an update
RE: Re: Terminology question: Core vs. Collection vs...
Yes. And its worth to note that when having multiple shards in a single node(@deprecated) that they are shards of different collections... brbrbr--- Original Message --- On 1/3/2013 09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an brinstance of a Solr server. br brAnd, technically, you can have multiple shards in a single instance of Solr, brseparating the logical sharding of keys from the distribution of the data. br br-- Jack Krupansky br br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:08 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brOops... let me word that a little more carefully: br br...we are replicating the data of each shard. br br br br br br-- Jack Krupansky br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:03 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more brmodern brbrterm and incorporates the fact that the brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brbrbrcore being a replica of the other cores within that shard of that brbrbrcollection. brbrbr brbrbrInstance is a general term, but is commonly used to refer to a brrunning brbrSolr brbrbrserver, each of which can service any number of cores. A sharded brbrcollection brbrbrwould typically require multiple instances of Solr, each with a brshard of brbrthe brbrbrcollection. brbrbr brbrbrMultiple collections can be supported on a single instance of Solr. brThey brbrbrdon't have to be sharded or replicated. But if they are, each Solr brbrinstance brbrbrwill have a copy or replica of the data (index) of one shard of each brbrsharded brbrbrcollection - to the degree that each collection needs that many brshards. brbrbr brbrbrAt the API level, you talk to a Solr instance, using a host and brport, brbrand brbrbrgiving
RE: Re: Terminology question: Core vs. Collection vs...
Ah, ok. Good. Makes sense. I think I will draw all this up in a UML that includes the distinction between the logical terms and the physical terms (and their mapping) as they do get intertwined. I'll post it here when I'm done. brbrbr--- Original Message --- On 1/3/2013 09:19 AM Jack Krupansky wrote:brA single shard MAY exist on a single core, but only if it is not replicated. brGenerally, a single shard will exist on multiple cores, each a replica of brthe source data as it comes into the update handler. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 9:10 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks. I got that part. br brA group of shards (and therefore cores) represent a collection, yes. But a brsingle shard exist only on a single core? br brbrbrbr--- Original Message --- brOn 1/3/2013 09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or brslice) of the collection. Sharding is a way of brbrslicing the original data, before we talk about how the shards get brstored brbrand replicated on actual Solr cores. Replicas are instances of the data brfor brbra shard. brbr brbrSometimes people may loosely speak of a replica as being a shard, but brbrthat's just loose use of the terminology. brbr brbrSo, we're not sharding shards, but we are replicating shards. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:51 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrThanks again. (And sorry to jump into this convo) brbr brbrBut I had a question on your statement: brbr brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote: brbr brCollection is the more modern term and incorporates the fact that brthe brbrbrcollection may be sharded, with each shard on one or more cores, brwith brbreach brcore being a replica of the other cores within that shard of brthat brbrbrcollection. brbr brbrA collection is sharded, meaning it is distributed across cores. A shard brbritself is not distributed across cores in the same since. Rather a shard brbrexist on a single core and is replicated on other cores. Is that right? brThe brbrway its worded above, it sounds like a shard can also be sharded... brbr brbr brbrbrbrbr--- Original Message --- brbrOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brbrcluster or cloud (graph). It could be a real brbrbrmachine or a virtualized machine. Technically, you could have brmultiple brbrbrvirtual nodes on the same physical box. Each Solr replica would be bron brbra brbrbrdifferent node. brbrbr brbrbrTechnically, you could have multiple Solr instances running on a brsingle brbrbrhardware node, each with a different port. They are simply instances brof brbrbrSolr, although you could consider each Solr instance a node in a brSolr brbrcloud brbrbras well, a virtual node. So, technically, you could have multiple brbrreplicas brbrbron the same node, but that sort of defeats most of the purpose of brhaving brbrbrreplicas in the first place - to distribute the data for performance brand brbrbrfault tolerance. But, you could have replicas of different shards on brthe brbrbrsame node/box for a partial improvement of performance and fault brbrtolerance. brbrbr brbrbrA Solr cloud' is really a cluster. brbrbr brbrbr-- Jack Krupansky brbrbr brbrbr-Original Message- brbrbrFrom: Darren Govoni brbrbrSent: Thursday, January 03, 2013 8:16 AM brbrbrTo: solr-user@lucene.apache.org brbrbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbrbr brbrbrGood write up. brbrbr brbrbrAnd what about node? brbrbr brbrbrI think there needs to be an official glossary of terms that is brbrsanctioned brbrbrby the solr team and some terms still ni use may need to be labeled brbrbrdeprecated. After so many years, its still confusing. brbrbr brbrbrbrbrbr--- Original Message --- brbrbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the brmore brbrmodern brbrbrterm and incorporates the fact that the brbrbrbrcollection may be sharded, with each shard on one or more cores, brbrwith brbrbreach brbrbrbrcore being a replica of the other cores within that shard of brthat brbrbrbrcollection. brbrbrbr brbrbrbrInstance is a general term, but is commonly used to refer to a brbrrunning brbrbrSolr brbrbrbrserver, each of which can service any number of cores. A sharded brbrbrcollection brbrbrbrwould typically require multiple instances of Solr, each with a brbrshard of brbrbrthe brbrbrbrcollection. brbrbrbr brbrbrbrMultiple collections can be supported on a single instance of brSolr. brbrThey brbrbrbrdon't have to be sharded or replicated. But if they are, each brSolr brbrbrinstance brbrbrbrwill have a copy or replica of the data (index) of one
RE: Re: Terminology question: Core vs. Collection vs...
Great point. brbrbr--- Original Message --- On 1/3/2013 10:42 AM Per Steffensen wrote:brOn 1/3/13 4:33 PM, Mark Miller wrote: br This has pretty much become the standard across other distributed systems and in the literat…err…books. brHmmm Im not sure you are right about that. Maybe more than one brdistributed system calls them Replica, but there is also a lot that brdoesnt. But if you are right, thats at least a good valid argument to do brit this way, even though I generally prefer good logical naming over brfollowing bad naming from the industry :-) Just because there is a lot brof crap out there, doesnt mean that we also want to make crap. Maybe brgood logical naming could even be a small entry in the Why Solr is brbetter than its competitors list :-) br
RE: Re: Terminology question: Core vs. Collection vs...
And based on the previous explanation there is never a copy of a shard. A shard represents and contains only replicas for itself, replicas being copies of cores within the shard. brbrbr--- Original Message --- On 1/3/2013 11:58 AM Walter Underwood wrote:brA factor is multiplied, so multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that shard. br brI think that recycling the term replication within Solr was confusing, but it is a bit late to change that. br brwunder br brOn Jan 3, 2013, at 7:33 AM, Mark Miller wrote: br br This has pretty much become the standard across other distributed systems and in the literat…err…books. br br I first implemented it as you mention you'd like, but Yonik correctly pointed out that we were going against the grain. br br - Mark br br On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote: br br For the same reasons that Replica shouldnt be called Replica (it requires to long an explanation to agree that it is an ok name), replicationFactor shouldnt be called replicationFactor and long as it referes to the TOTAL number of cores you get for your Shard. replicationFactor would be an ok name if replicationFactor=0 meant one core, replicationFactor=1 meant two cores etc., but as long as replicationFactor=1 means one core, replicationFactor=2 means two cores, it is bad naming (you will not get any replication with replicationFactor=1 - WTF!?!?). If we want to insist that you specify the total number of cores at least use replicaPerShard instead of replicationFactor, or even better rename Replica to Shard-instance and use instancesPerShard instead of replicationFactor. br br Regards, Per Steffensen br br On 1/3/13 3:52 PM, Per Steffensen wrote: br Hi br br Here is my version - do not believe the explanations have been very clear br br We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard) br 1) Machines (virtual or physical) running Solr server JVMs (one machine can run several Solr server JVMs if you like) br 2) Solr server JVMs br 3) Logical stores where you can add/update/delete data-instances (closest to logical tables in RDBMS) br 4) Logical slices of a store (closest to non-overlapping logical sets of rows for the logical table in a RDBMS) br 5) Physical instances of slices (a physical (disk/memory) instance of the a logical slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical concepts br br Terminology br 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer to a Solr server JVM. br 2) Node br 3) Collection br 4) Shard. Used to be called Slice but I believe now it is officially called Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this logical/non-physical concept - just needs to be reflected it across documentation and code br 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue (if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core) contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code. br br Regards, Per Steffensen br br br br-- brWalter Underwood brwun...@wunderwood.org br br br br
Re: Terminology question: Core vs. Collection vs...
I see. So sharding and distributing/replicating can have separate and different advantages. On 01/03/2013 01:06 PM, Lance Norskog wrote: Also, searching can be much faster if you put all of the shards on one machine, and the search distributor. That way, you search with multiple simultaneous threads inside one machine. I've seen this make searches several times faster. On 01/03/2013 06:36 AM, Jack Krupansky wrote: Ah... the multiple shards (of the same collection) in a single node is about planning for future expansion of your cluster - create more shards than you need today, put more of them on a single node and then migrate them to their own nodes as the data outgrows the smaller number of nodes. In other words, add nodes incrementally without having to reindex all the data. -- Jack Krupansky -Original Message- From: Darren Govoni Sent: Thursday, January 03, 2013 9:18 AM To: solr-user@lucene.apache.org Subject: RE: Re: Terminology question: Core vs. Collection vs... Yes. And its worth to note that when having multiple shards in a single node(@deprecated) that they are shards of different collections... brbrbr--- Original Message --- On 1/3/2013 09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an brinstance of a Solr server. br brAnd, technically, you can have multiple shards in a single instance of Solr, brseparating the logical sharding of keys from the distribution of the data. br br-- Jack Krupansky br br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:08 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brOops... let me word that a little more carefully: br br...we are replicating the data of each shard. br br br br br br-- Jack Krupansky br-Original Message- brFrom: Jack Krupansky brSent: Thursday, January 03, 2013 9:03 AM brTo: solr-user@lucene.apache.org brSubject: Re: Terminology question: Core vs. Collection vs... br brNo, a shard is a subset (or slice) of the collection. Sharding is a way of brslicing the original data, before we talk about how the shards get stored brand replicated on actual Solr cores. Replicas are instances of the data for bra shard. br brSometimes people may loosely speak of a replica as being a shard, but brthat's just loose use of the terminology. br brSo, we're not sharding shards, but we are replicating shards. br br-- Jack Krupansky br br-Original Message- brFrom: Darren Govoni brSent: Thursday, January 03, 2013 8:51 AM brTo: solr-user@lucene.apache.org brSubject: RE: Re: Terminology question: Core vs. Collection vs... br brThanks again. (And sorry to jump into this convo) br brBut I had a question on your statement: br brOn 1/3/2013 08:07 AM Jack Krupansky wrote: br brCollection is the more modern term and incorporates the fact that the brbrcollection may be sharded, with each shard on one or more cores, with breach brcore being a replica of the other cores within that shard of that brbrcollection. br brA collection is sharded, meaning it is distributed across cores. A shard britself is not distributed across cores in the same since. Rather a shard brexist on a single core and is replicated on other cores. Is that right? The brway its worded above, it sounds like a shard can also be sharded... br br brbrbrbr--- Original Message --- brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a brcluster or cloud (graph). It could be a real brbrmachine or a virtualized machine. Technically, you could have multiple brbrvirtual nodes on the same physical box. Each Solr replica would be on bra brbrdifferent node. brbr brbrTechnically, you could have multiple Solr instances running on a single brbrhardware node, each with a different port. They are simply instances of brbrSolr, although you could consider each Solr instance a node in a Solr brcloud brbras well, a virtual node. So, technically, you could have multiple brreplicas brbron the same node, but that sort of defeats most of the purpose of having brbrreplicas in the first place - to distribute the data for performance and brbrfault tolerance. But, you could have replicas of different shards on the brbrsame node/box for a partial improvement of performance and fault brtolerance. brbr brbrA Solr cloud' is really a cluster. brbr brbr-- Jack Krupansky brbr brbr-Original Message- brbrFrom: Darren Govoni brbrSent: Thursday, January 03, 2013 8:16 AM brbrTo: solr-user@lucene.apache.org brbrSubject: RE: Re: Terminology question: Core vs. Collection vs... brbr brbrGood write up. brbr brbrAnd what about node? brbr brbrI think there needs to be an official glossary of terms that is brsanctioned brbrby the solr team and some terms still ni use may need to be labeled brbrdeprecated. After so many years, its still confusing. brbr brbrbrbrbr--- Original Message
RE: Does SolrCloud supports MoreLikeThis?
There is a ticket for that with some recent activity (sorry I don't have it handy right now), but I'm not sure if that work made it into the trunk, so probably solrcloud does not support MLT...yet. Would love an update from the dev team though! brbrbr--- Original Message --- On 11/5/2012 10:37 AM Luis Cappa Banda wrote:brThat´s the question, :-) br brRegards, br brLuis Cappa. br
Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download
It certainly seems to be a rogue project, but I can't understand the meaning of realtime near realtime (NRT) either. At best, its oxymoronic. On 10/29/2012 10:30 AM, Jack Krupansky wrote: Could any of the committers here confirm whether this is a legitimate effort? I mean, how could anything labeled Apache ABC with XYZ be an external project and be sanctioned/licensed by Apache? In fact, the linked web page doesn't even acknowledge the ownership of the Apache trademarks or ASL. And the term Realtime NRT is nonsensical. Even worse: Realtime NRT makes available a near realtime view. Equally nonsensical. Who knows, maybe it is legit, but it sure comes across as a scam/spam. -- Jack Krupansky -Original Message- From: Nagendra Nagarajayya Sent: Monday, October 29, 2012 10:06 AM To: solr-user@lucene.apache.org Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download Hi! I am very excited to announce the availability of Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high performance and more granular NRT implementation as to soft commit. The update performance is about 70,000 documents / sec* (almost 1.5-2x performance improvement over soft-commit). You can also scale up to 2 billion documents* in a single core, and query half a billion documents index in ms**. Realtime NRT is different from realtime-get. realtime-get does not have search capability and is a lookup by id. Realtime NRT allows full search, see here http://solr-ra.tgels.org/realtime-nrt.jsp for more info. Realtime NRT has been contributed back to Solr, see JIRA: https://issues.apache.org/jira/browse/SOLR-3816 RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or boolean/dismax/boost queries and is compatible with the new Lucene 4.0 api. You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here: http://solr-ra.tgels.org Please download and give the new version a try. Note: 1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org * performance is a real use case of Apache Solr with RankingAlgorithm as seen at a user installation ** performance seen when using the age feature
Re: Cloud terminology clarification
I agree it needs updating and I've always gotten confused at some point by the use (misuse) of terms. For example, the term 'node' is thrown around a lot too. What is it??! Hehe. On Sat, 2012-09-08 at 22:26 -0700, JesseBuesking wrote: It's been a while since the terminology at http://wiki.apache.org/solr/SolrTerminology has been updated, so I'm wondering how these terms apply to solr cloud setups. My take on what the terms mean: Collection: Basically the highest level container that bundles together the other pieces for servicing a particular search setup Core: An individual solr instance (represents entire indexes) Shard: A portion of a core (represents a subset of an index) Therefore: - increasing the number of shards allows for indexing more documents (aka scaling the amount of data that can be indexed) - increasing the number of cores increases the potential throughput of requests (aka cores mirror each other allowing you to distribute requests to multiple servers) Does this sound right? If so, then my follow up question would be does the following directory structure look right/standard? .../solr # = solr home .../solr/collection-01 .../solr/collection-01/core-01 .../solr/collection-01/core-02 And if this is right, I'm on a roll :D My next question would then be: Given we're using zookeeper (separate machine), do we need 1 conf folder at collection-01's level? Or do we need 1 conf folder per core? -- View this message in context: http://lucene.472066.n3.nabble.com/Cloud-terminology-clarification-tp4006407.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Map/Reduce directly against solr4 index.
Of course you can do it, but the question is whether this will produce the performance results you expect. I've seen talk about this in other forums, so you might find some prior work here. Solr and HDFS serve somewhat different purposes. The key issue would be if your map and reduce code overloads the Solr endpoint. Even using SolrCloud, I believe all requests will have to go through a single URL (to be routed), so if you have thousands of map/reduce jobs all running simultaneously, the question is whether your Solr is architected to handle that amount of throughput. On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster.
Re: Map/Reduce directly against solr4 index.
You raise an interesting possibility. A map/reduce solr handler over solrcloud... On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote: I think the performance should be close to Hadoop running on HDFS, if somehow Hadoop job can directly read the Solr Index file while executing the job on the local solr node. Kindna like how HBase and Cassadra integrate with Hadoop. Plus, we can run the map reduce job on a standby Solr4 cluster. This way, the documents in Solr will be our primary source of truth. And we have the ability to run near real time search queries and analytics on it. No need to export data around. Solr4 is becoming a very interesting solution to many web scale problems. Just missing the map/reduce component. :) On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote: Of course you can do it, but the question is whether this will produce the performance results you expect. I've seen talk about this in other forums, so you might find some prior work here. Solr and HDFS serve somewhat different purposes. The key issue would be if your map and reduce code overloads the Solr endpoint. Even using SolrCloud, I believe all requests will have to go through a single URL (to be routed), so if you have thousands of map/reduce jobs all running simultaneously, the question is whether your Solr is architected to handle that amount of throughput. On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote: Is it possible to run map reduce jobs directly on Solr4? I'm asking this because I want to use Solr4 as the primary storage engine. And I want to be able to run near real time analytics against it as well. Rather than export solr4 data out to a hadoop cluster.
Re: [Announce] Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT available for download
What exactly is Realtime NRT (Near Real Time)? On Sun, 2012-07-22 at 14:07 -0700, Nagendra Nagarajayya wrote: Hi! I am very excited to announce the availability of Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT. The Realtime NRT implementation now supports both RankingAlgorithm and Lucene. Realtime NRT is a high performance and more granular NRT implementation as to soft commit. The update performance is about 70,000 documents / sec*. You can also scale up to 2 billion documents* in a single core, and query half a billion documents index in ms**. RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or boolean queries and is compatible with the new Lucene 4.0-ALPHA api. You can get more information about Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 Realtime performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x You can download Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org * performance seen at a user installation of Solr 4.0 with RankingAlgorithm 1.4.3 ** performance seen when using the age feature
Re: Facet on all the dynamic fields with *_s feature
You'll have to query the index for the fields and sift out the _s ones and cache them or something. On Mon, 2012-07-16 at 16:52 +0530, Rajani Maski wrote: Yes, This feature will solve the below problem very neatly. All, Is there any approach to achieve this for now? --Rajani On Sun, Jul 15, 2012 at 6:02 PM, Jack Krupansky j...@basetechnology.comwrote: The answer appears to be No, but it's good to hear people express an interest in proposed features. -- Jack Krupansky -Original Message- From: Rajani Maski Sent: Sunday, July 15, 2012 12:02 AM To: solr-user@lucene.apache.org Subject: Facet on all the dynamic fields with *_s feature Hi All, Is this issue fixed in solr 3.6 or 4.0: Faceting on all Dynamic field with facet.field=*_s Link : https://issues.apache.org/**jira/browse/SOLR-247https://issues.apache.org/jira/browse/SOLR-247 If it is not fixed, any suggestion on how do I achieve this? My requirement is just same as this one : http://lucene.472066.n3.**nabble.com/Dynamic-facet-** field-tc2979407.html#nonehttp://lucene.472066.n3.nabble.com/Dynamic-facet-field-tc2979407.html#none Regards Rajani
Re: Solr Faceting
I don't think it comes at any added cost for solr to return that facet so you can filter it out in your business logic. On Sat, 2012-07-07 at 15:18 +0530, Shanu Jha wrote: Hi, I am generating facet for a field which has one of the value NA and I want solr should not create facet(or ignore) for this(NA) value. Is there any way to in solr to do that. Thanks
Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support
I don't recall anyone being able to get acceptable performance with a single index that large with solr/lucene. The conventional wisdom is that parallel searching across cores (or shards in SolrCloud) is the best way to handle index sizes in the illions. So its of great interest how you did. Anyone else gotten an index(es) with billions of documents to perform well? I'm greatly interested in how. On Mon, 2012-05-28 at 05:12 -0700, Nagendra Nagarajayya wrote: It is a single node. I am trying to find out if the performance can be referenced. Regarding information on Solr with RankingAlgorithm, you can find all the information here: http://solr-ra.tgels.org On RankingAlgorithm: http://rankingalgorithm.tgels.org Regards, - NN On 5/27/2012 4:50 PM, Li Li wrote: yes, I am also interested in good performance with 2 billion docs. how many search nodes do you use? what's the average response time and qps ? another question: where can I find related paper or resources of your algorithm which explains the algorithm in detail? why it's better than google site(better than lucene is not very interested because lucene is not originally designed to provide search function like google)? On Mon, May 28, 2012 at 1:06 AM, Darren Govonidar...@ontrenet.com wrote: I think people on this list would be more interested in your approach to scaling 2 billion documents than modifying solr/lucene scoring (which is already top notch). So given that, can you share any references or otherwise substantiate good performance with 2 billion documents? Thanks. On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote: Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion docs. With RankingAlgorithm 1.4.3, using the parameters age=latestdocs=number feature, you can retrieve the NRT inserted documents in milliseconds from such a huge index improving query and faceting performance and using very little resources ... Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and the NRT insert performance with Solr 4.0 is about 70,000 docs / sec. RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 5/27/2012 7:32 AM, Darren Govoni wrote: Hi, Have you tested this with a billion documents? Darren On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote: Hi! I am very excited to announce the availability of Solr 3.6 with RankingAlgorithm 1.4.2. This NRT supports now works with both RankingAlgorithm and Lucene. The insert/update performance should be about 5000 docs in about 490 ms with the MbArtists Index. RankingAlgorithm 1.4.2 has multiple algorithms, improved performance over the earlier releases, supports the entire Lucene Query Syntax, ± and/or boolean queries and can scale to more than a billion documents. You can get more information about NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org ps. MbArtists index is the example index used in the Solr 1.4 Enterprise Book
Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support
Hi, Have you tested this with a billion documents? Darren On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote: Hi! I am very excited to announce the availability of Solr 3.6 with RankingAlgorithm 1.4.2. This NRT supports now works with both RankingAlgorithm and Lucene. The insert/update performance should be about 5000 docs in about 490 ms with the MbArtists Index. RankingAlgorithm 1.4.2 has multiple algorithms, improved performance over the earlier releases, supports the entire Lucene Query Syntax, ± and/or boolean queries and can scale to more than a billion documents. You can get more information about NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org ps. MbArtists index is the example index used in the Solr 1.4 Enterprise Book
Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support
I think people on this list would be more interested in your approach to scaling 2 billion documents than modifying solr/lucene scoring (which is already top notch). So given that, can you share any references or otherwise substantiate good performance with 2 billion documents? Thanks. On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote: Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion docs. With RankingAlgorithm 1.4.3, using the parameters age=latestdocs=number feature, you can retrieve the NRT inserted documents in milliseconds from such a huge index improving query and faceting performance and using very little resources ... Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and the NRT insert performance with Solr 4.0 is about 70,000 docs / sec. RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org On 5/27/2012 7:32 AM, Darren Govoni wrote: Hi, Have you tested this with a billion documents? Darren On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote: Hi! I am very excited to announce the availability of Solr 3.6 with RankingAlgorithm 1.4.2. This NRT supports now works with both RankingAlgorithm and Lucene. The insert/update performance should be about 5000 docs in about 490 ms with the MbArtists Index. RankingAlgorithm 1.4.2 has multiple algorithms, improved performance over the earlier releases, supports the entire Lucene Query Syntax, ± and/or boolean queries and can scale to more than a billion documents. You can get more information about NRT performance from here: http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here: http://solr-ra.tgels.org Please download and give the new version a try. Regards, Nagendra Nagarajayya http://solr-ra.tgels.org http://rankingalgorithm.tgels.org ps. MbArtists index is the example index used in the Solr 1.4 Enterprise Book
SolrCloud war context name?
Hi, I am running my solrcloud nodes in an app server deployed into the context path 'solr' and zookeeper sees all of them. I want to deploy a second solrcloud war into the same app server (thus same IP:port) in a different context like 'solrrep' with the same config (cloned). Will this work? Or does zookeeper (or solrcloud leader) require all connected solr shards to have context url with ip:port/solr? Or will the correct URL be registered from the replica shard? thanks!
Re: SolrCloud war context name?
It's not really clear from the wiki how to use cores as shard replicas within the same solr server. In my mind, having a separate JVM/solr node/ acting as a replica makes sense because the replication traffic will be on a different channel in a different vm and won't interfere with search/indexing traffic on the primary shards. Or am I missing something easy about using cores with solr cloud? It was mentioned on the list recently that managing cores with solrcloud isn't really the best practice for it. On Sat, 2012-05-26 at 16:12 -0300, Marcelo Carvalho Fernandes wrote: Why not using multicore? Marcelo Carvalho Fernandes +55 21 8272-7970 On Sat, May 26, 2012 at 12:56 PM, Darren Govoni ontre...@ontrenet.comwrote: Hi, I am running my solrcloud nodes in an app server deployed into the context path 'solr' and zookeeper sees all of them. I want to deploy a second solrcloud war into the same app server (thus same IP:port) in a different context like 'solrrep' with the same config (cloned). Will this work? Or does zookeeper (or solrcloud leader) require all connected solr shards to have context url with ip:port/solr? Or will the correct URL be registered from the replica shard? thanks!
RE: Re: SolrCloud: how to index documents into a specific core and how to search against that core?
I'm curious what the solrcloud experts say, but my suggestion is to try not to over-engineering the search architecture on solrcloud. For example, what is the benefit of managing the what cores are indexed and searched? Having to know those details, in my mind, works against the automation in solrcore, but maybe there's a good reason you want to do it this way. brbrbr--- Original Message --- On 5/22/2012 07:35 AM Yandong Yao wrote:brHi Darren, br brThanks very much for your reply. br brThe reason I want to control core indexing/searching is that I want to bruse one core to store one customer's data (all customer share same brconfig): such as customer 1 use coreForCustomer1 and customer 2 bruse coreForCustomer2. br brIs there any better way than using different core for different customer? br brAnother way maybe use different collection for different customer, while brnot sure how many collections solr cloud could support. Which way is better brin terms of flexibility/scalability? (suppose there are tens of thousands brcustomers). br brRegards, brYandong br br2012/5/22 Darren Govoni dar...@ontrenet.com br br Why do you want to control what gets indexed into a core and then br knowing what core to search? That's the kind of knowing that SolrCloud br solves. In SolrCloud, it handles the distribution of documents across br shards and retrieves them regardless of which node is searched from. br That is the point of cloud, you don't know the details of where br exactly documents are being managed (i.e. they are cloudy). It can br change and re-balance from time to time. SolrCloud performs the br distributed search for you, therefore when you try to search a node/core br with no documents, all the results from the cloud are retrieved br regardless. This is considered A Good Thing. br br It requires a change in thinking about indexing and searching br br On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote: br Hi Guys, br br I use following command to start solr cloud according to solr cloud wiki. br br yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf br -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar br yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 br -jar br start.jar br br Then I have created several cores using CoreAdmin API ( br http://localhost:8983/solr/admin/cores?action=CREATEname= br coreNamecollection=collection1), and clusterstate.json show following br topology: br br br collection1: br -- shard1: br-- collection1 br-- CoreForCustomer1 br-- CoreForCustomer3 br-- CoreForCustomer5 br -- shard2: br-- collection1 br-- CoreForCustomer2 br-- CoreForCustomer4 br br br 1) Index: br br Using following command to index mem.xml file in exampledocs directory. br br yydzero:exampledocs bjcoe$ java -Durl= br http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml br SimplePostTool: version 1.4 br SimplePostTool: POSTing files to br http://localhost:8983/solr/coreForCustomer3/update.. br SimplePostTool: POSTing file mem.xml br SimplePostTool: COMMITting Solr index changes. br br And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3', br 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2 br core has 0 documents. br br *Question 1:* Is this expected behavior? How do I to index documents br into br a specific core? br br *Question 2*: If SolrCloud don't support this yet, how could I extend it br to support this feature (index document to particular core), where br should i br start, the hashing algorithm? br br *Question 3*: Why the documents are also indexed into 'coreForCustomer1' br and 'coreForCustomer5'? The default replica for documents are 1, right? br br Then I try to index some document to 'coreForCustomer2': br br $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar br post.jar ipod_video.xml br br While 'coreForCustomer2' still have 0 documents and documents in br ipod_video br are indexed to core for customer 1/3/5. br br *Question 4*: Why this happens? br br 2) Search: I use br http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to br search against 'CoreForCustomer2', while it will return all documents in br the whole collection even though this core has no documents at all. br br Then I use br br http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2 br , br and it will return 0 documents. br br *Question 5*: So If want to search against a particular core, we need to br use 'shards' parameter and use solrCore name as parameter value, right? br br br Thanks very much in advance! br br Regards, br Yandong br br br br
Re: SolrCloud: how to index documents into a specific core and how to search against that core?
Why do you want to control what gets indexed into a core and then knowing what core to search? That's the kind of knowing that SolrCloud solves. In SolrCloud, it handles the distribution of documents across shards and retrieves them regardless of which node is searched from. That is the point of cloud, you don't know the details of where exactly documents are being managed (i.e. they are cloudy). It can change and re-balance from time to time. SolrCloud performs the distributed search for you, therefore when you try to search a node/core with no documents, all the results from the cloud are retrieved regardless. This is considered A Good Thing. It requires a change in thinking about indexing and searching On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote: Hi Guys, I use following command to start solr cloud according to solr cloud wiki. yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 -jar start.jar Then I have created several cores using CoreAdmin API ( http://localhost:8983/solr/admin/cores?action=CREATEname= coreNamecollection=collection1), and clusterstate.json show following topology: collection1: -- shard1: -- collection1 -- CoreForCustomer1 -- CoreForCustomer3 -- CoreForCustomer5 -- shard2: -- collection1 -- CoreForCustomer2 -- CoreForCustomer4 1) Index: Using following command to index mem.xml file in exampledocs directory. yydzero:exampledocs bjcoe$ java -Durl= http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml SimplePostTool: version 1.4 SimplePostTool: POSTing files to http://localhost:8983/solr/coreForCustomer3/update.. SimplePostTool: POSTing file mem.xml SimplePostTool: COMMITting Solr index changes. And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3', 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2 core has 0 documents. *Question 1:* Is this expected behavior? How do I to index documents into a specific core? *Question 2*: If SolrCloud don't support this yet, how could I extend it to support this feature (index document to particular core), where should i start, the hashing algorithm? *Question 3*: Why the documents are also indexed into 'coreForCustomer1' and 'coreForCustomer5'? The default replica for documents are 1, right? Then I try to index some document to 'coreForCustomer2': $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar post.jar ipod_video.xml While 'coreForCustomer2' still have 0 documents and documents in ipod_video are indexed to core for customer 1/3/5. *Question 4*: Why this happens? 2) Search: I use http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to search against 'CoreForCustomer2', while it will return all documents in the whole collection even though this core has no documents at all. Then I use http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2;, and it will return 0 documents. *Question 5*: So If want to search against a particular core, we need to use 'shards' parameter and use solrCore name as parameter value, right? Thanks very much in advance! Regards, Yandong
Re: Distributed search between solrclouds?
The thought here is to distribute a search between two different solrcloud clusters and get ordered ranked results between them. It's possible? On Tue, 2012-05-15 at 18:47 -0400, Darren Govoni wrote: Hi, Would distributed search (the old way where you provide the solr host IP's etc.) still work between different solrclouds? thanks, Darren
Distributed search between solrclouds?
Hi, Would distributed search (the old way where you provide the solr host IP's etc.) still work between different solrclouds? thanks, Darren
Re: Documents With large number of fields
Was there a response to this? On Fri, 2012-05-04 at 10:27 -0400, Keswani, Nitin - BLS CTR wrote: Hi, My data model consist of different types of data. Each data type has its own characteristics If I include the unique characteristics of each type of data, my single Solr Document could end up containing 300-400 fields. In order to drill down to this data set I would have to provide faceting on most of these fields so that I can drilldown to very small set of Documents. Here are some of the questions : 1) What's the best approach when dealing with documents with large number of fields . Should I keep a single document with large number of fields or split my document into a number of smaller documents where each document would consist of some fields 2) From an operational point of view, what's the drawback of having a single document with a very large number of fields. Can Solr support documents with large number of fields (say 300 to 400). Thanks. Regards, Nitin Keswani
Re: Documents With large number of fields
I'm also interested in this. Same situation. On Fri, 2012-05-04 at 10:27 -0400, Keswani, Nitin - BLS CTR wrote: Hi, My data model consist of different types of data. Each data type has its own characteristics If I include the unique characteristics of each type of data, my single Solr Document could end up containing 300-400 fields. In order to drill down to this data set I would have to provide faceting on most of these fields so that I can drilldown to very small set of Documents. Here are some of the questions : 1) What's the best approach when dealing with documents with large number of fields . Should I keep a single document with large number of fields or split my document into a number of smaller documents where each document would consist of some fields 2) From an operational point of view, what's the drawback of having a single document with a very large number of fields. Can Solr support documents with large number of fields (say 300 to 400). Thanks. Regards, Nitin Keswani
SolrCloud indexing question
Hi, I just wanted to make sure I understand how distributed indexing works in solrcloud. Can I index locally at each shard to avoid throttling a central port? Or all the indexing has to go through a single shard leader? thanks
Re: SolrCloud indexing question
Gotcha. Now does that mean if I have 5 threads all writing to a local shard, will that shard piggyhop those index requests onto a SINGLE connection to the leader? Or will they spawn 5 connections from the shard to the leader? I really hope the formerthe latter won't scale well. On Fri, 2012-04-20 at 10:28 -0400, Jamie Johnson wrote: my understanding is that you can send your updates/deletes to any shard and they will be forwarded to the leader automatically. That being said your leader will always be the place where the index happens and then distributed to the other replicas. On Fri, Apr 20, 2012 at 7:54 AM, Darren Govoni dar...@ontrenet.com wrote: Hi, I just wanted to make sure I understand how distributed indexing works in solrcloud. Can I index locally at each shard to avoid throttling a central port? Or all the indexing has to go through a single shard leader? thanks
Re: Opposite to MoreLikeThis?
You could run the MLT for the document in question, then gather all those doc id's in the MLT results and negate those in a subsequent query. Not sure how robust that would work with very large result sets, but something to try. Another approach would be to gather the interesting terms from the document in question and then negate those terms in subsequent queries. Perhaps with many negated terms, Solr will rank the results based on most negated terms above less negated terms, simulating a ranked less like effect. On Fri, 2012-04-20 at 15:38 -0700, Charlie Maroto wrote: Hi all, Is there a way to implement the opposite to MoreLikeThis (LessLikeThis, I guess :). The requirement we have is to remove all documents with content like that of a given document id or a text provided by the end-user. In the current index implementation (not using Solr), the user can narrow results by indicating what document(s) are not relevant to him and then request to remove from the search results any document whose content is like that of the selected document(s) Our index has close to 100 million documents and they cover multiple topics that are not related to one another. So, a search for some broad terms may retrieve documents about engineering, agriculture, communications, etc. As the user is trying to discover the relevant documents, he may select an agriculture-related document to exclude it and those documents like it from the results set; same w/ engineering-like content, etc. until most of the documents are about communications. Of course, some exclusions may actually remove relevant content but those filters can be removed to go back to the previous set of results. Any ideas from similar implementations or suggestions are welcomed! Thanks, Carlos
Re: hierarchical faceting?
Put the parent term in all the child documents at index time and the re-issue the facet query when you expand the parent using the parent's term. works perfect. On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote: I have hierarchical colors: field name=colors type=text_pathindexed=true stored=true multiValued=true/ text_path is TextField with PathHierarchyTokenizerFactory as tokenizer. Given these two documents, Doc1: red Doc2: red/pink I want the result to be the following: ?fq=red == Doc1, Doc2 ?fq=red/pink == Doc2 But, with PathHierarchyTokenizer, Doc1 is included for the query: ?fq=red/pink == Doc1, Doc2 How can I query for hierarchical facets? http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix.. But it looks too cumbersome to me. Is there a simpler way to implement hierarchical facets?
Re: hierarchical faceting?
I don't use any of that stuff in my app, so not sure how it works. I just manage my taxonomy outside of solr at index time and don't need any special fields or tokenizers. I use a string field type and insert the proper field at index time and query it normally. Nothing special required. On Wed, 2012-04-18 at 13:00 -0400, sam ” wrote: It looks like TextField is the problem. This fixed: fieldType name=text_path class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PathHierarchyTokenizerFactory delimiter=// /analyzer analyzer type=query tokenizer class=solr.WhitespaceTokenizerFactory/ /analyzer /fieldType I am assuming the text_path fields won't include whitespace characters. ?q=colors:red/pink == Doc2 (Doc1, which has colors = red isn't included!) Is there a tokenizer that tokenizes the string as one token? I tried to extend Tokenizer myself but it fails: public class AsIsTokenizer extends Tokenizer { @Override public boolean incrementToken() throws IOException { return true;//or false; } } On Wed, Apr 18, 2012 at 11:33 AM, sam ” skyn...@gmail.com wrote: Yah, that's exactly what PathHierarchyTokenizer does. fieldType name=text_path class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.PathHierarchyTokenizerFactory/ /analyzer /fieldType I think I have a query time tokenizer that tokenizes at / ?q=colors:red == Doc1, Doc2 ?q=colors:redfoobar == ?q=colors:red/foobarasdfoaijao == Doc1, Doc2 On Wed, Apr 18, 2012 at 11:10 AM, Darren Govoni dar...@ontrenet.comwrote: Put the parent term in all the child documents at index time and the re-issue the facet query when you expand the parent using the parent's term. works perfect. On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote: I have hierarchical colors: field name=colors type=text_pathindexed=true stored=true multiValued=true/ text_path is TextField with PathHierarchyTokenizerFactory as tokenizer. Given these two documents, Doc1: red Doc2: red/pink I want the result to be the following: ?fq=red == Doc1, Doc2 ?fq=red/pink == Doc2 But, with PathHierarchyTokenizer, Doc1 is included for the query: ?fq=red/pink == Doc1, Doc2 How can I query for hierarchical facets? http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix.. But it looks too cumbersome to me. Is there a simpler way to implement hierarchical facets?
Re: Monitoring SolrCloud health
Can you be more specific about health? On Sat, 2012-04-14 at 00:03 -0400, Jamie Johnson wrote: How do people currently monitor the health of a solr cluster? Are there any good tools which can show the health across the entire cluster? Is this something which is planned for the new admin user interface?
RE: Realtime /get versus SearchHandler
Yes brbrbr--- Original Message --- On 4/13/2012 06:25 AM Benson Margulies wrote:brA discussion over on the dev list led me to expect that the by-if brfield retrievals in a SolrCloud query would come through the get brhandler. In fact, I've seen them turn up in my search component in the brsearch handler that is configured with my custom QT. (I have a br'prepare' method that sets ShardParams.QT to my QT to get my brprocessing involved in the first of the two queries.) Did I overthink brthis? br br
Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
You could use SolrCloud (for the automatic scaling) and just mount a fuse[1] HDFS directory and configure solr to use that directory for its data. [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: Hi, I'm trying to setup a large scale *Crawl + Index + Search *infrastructure using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, crawled + indexed every *4 weeks, *with a search latency of less than 0.5 seconds. Needless to mention, the search index needs to scale to 5Billion pages. It is also possible that I might need to store multiple indexes -- one for crawled content, and one for ancillary data that is also very large. Each of these indices would likely require a logically distributed and replicated index. However, I would like for such a system to be homogenous with the Hadoop infrastructure that is already installed on the cluster (for the crawl). In other words, I would much prefer if the replication and distribution of the Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of using another scalability framework (such as SolrCloud). In addition, it would be ideal if this environment was flexible enough to be dynamically scaled based on the size requirements of the index and the search traffic at the time (i.e. if it is deployed on an Amazon cluster, it should be easy enough to automatically provision additional processing power into the cluster without requiring server re-starts). However, I'm not sure which Solr-based tool in the Hadoop ecosystem would be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra, Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is mature enough and would be the right architectural choice to go along with a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects above. Lastly, how much hardware (assuming a medium sized EC2 instance) would you estimate my needing with this setup, for regular web-data (HTML text) at this scale? Any architectural guidance would be greatly appreciated. The more details provided, the wider my grin :). Many many thanks in advance. Thanks, Safdar
RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Solrcloud or any other tech specific replication isnt going to 'just work' with hadoop replication. But with some significant custom coding anything should be possible. Interesting idea. brbrbr--- Original Message --- On 4/12/2012 09:21 AM Ali S Kureishy wrote:brThanks Darren. br brActually, I would like the system to be homogenous - i.e., use Hadoop based brtools that already provide all the necessary scaling for the lucene index br(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds brits own layer of sharding/replication that is outside Hadoop, I feel that brusing SolrCloud would be redundant, and a step in the opposite brdirection, which is what I'm trying to avoid in the first place. Or am I brmistaken? br brThanks, brSafdar br br brOn Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote: br br You could use SolrCloud (for the automatic scaling) and just mount a br fuse[1] HDFS directory and configure solr to use that directory for its br data. br br [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS br br On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote: br Hi, br br I'm trying to setup a large scale *Crawl + Index + Search *infrastructure br using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*, br crawled + indexed every *4 weeks, *with a search latency of less than 0.5 br seconds. br br Needless to mention, the search index needs to scale to 5Billion pages. br It br is also possible that I might need to store multiple indexes -- one for br crawled content, and one for ancillary data that is also very large. Each br of these indices would likely require a logically distributed and br replicated index. br br However, I would like for such a system to be homogenous with the Hadoop br infrastructure that is already installed on the cluster (for the crawl). br In br other words, I would much prefer if the replication and distribution of br the br Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of br using another scalability framework (such as SolrCloud). In addition, it br would be ideal if this environment was flexible enough to be dynamically br scaled based on the size requirements of the index and the search traffic br at the time (i.e. if it is deployed on an Amazon cluster, it should be br easy br enough to automatically provision additional processing power into the br cluster without requiring server re-starts). br br However, I'm not sure which Solr-based tool in the Hadoop ecosystem would br be ideal for this scenario. I've heard mention of Solr-on-HBase, br Solandra, br Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these br is br mature enough and would be the right architectural choice to go along br with br a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling br aspects br above. br br Lastly, how much hardware (assuming a medium sized EC2 instance) would br you br estimate my needing with this setup, for regular web-data (HTML text) at br this scale? br br Any architectural guidance would be greatly appreciated. The more details br provided, the wider my grin :). br br Many many thanks in advance. br br Thanks, br Safdar br br br br
Re: I've broken delete in SolrCloud and I'm a bit clueless as to how
Hard to say why its not working for you. Start with a fresh Solr and work forward from there or back out your configs and plugins until it works again. On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote: In my cloud configuration, if I push delete query*:*/query /delete followed by: commit/ I get no errors, the log looks happy enough, but the documents remain in the index, visible to /query. Here's what seems my relevant bit of solrconfig.xml. My URP only implements processAdd. updateRequestProcessorChain name=RNI !-- some day, add parameters when we have some -- processor class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/ processor class=solr.LogUpdateProcessorFactory / processor class=solr.DistributedUpdateProcessorFactory/ processor class=solr.RunUpdateProcessorFactory / /updateRequestProcessorChain !-- activate RNI processing by adding the RNI URP to the chain for xml updates -- requestHandler name=/update class=solr.XmlUpdateRequestHandler lst name=defaults str name=update.chainRNI/str /lst /requestHandler
RE: SOLR issue - too many search queries
My first reaction to your question is why are you running thousands of queries in a loop? Immediately, I think this will not scale well and the design probably needs to be re-visited. Second, if you need that many requests, then you need to seriously consider an architecture that supports it. This will require a complex design involving load balancers, multiple servers, replication, etc. People have achieved this with Solr, but it's beyond the scope of Solr itself to provide this, as its a matter of system architecture. Also, there are limits to the number of app server threads allowed, OS threads allowed, OS sockets, OS file descriptors, etc. etc. All of which need to be understood, designed for and configured properly. brbrbr--- Original Message --- On 4/10/2012 07:51 AM arunssasidhar wrote:brWe have a PHP web application which is using SOLR for searching. The APP is brusing CURL to connect to the SOLR server and which run in a loop with brthousands of predefined keywords. That will create thousands of different brsearch quires to SOLR at a given time. br brMy issue is that, when a single user logged into the app everything is brworking as expected. When there is more than one user is trying to run the brapp we are getting this response from the server. br brFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested braddressFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested braddressFailed br brOur assumption is that, SOLR server is unable to handle this much search brqueries at a given time. If so what is the solution to overcome this?. Is brthere any settings like keep-alive in SOLR? br brAny help would be highly appreciate. br brThanks, br brArun S br br br-- brView this message in context: http://lucene.472066.n3.nabble.com/SOLR-issue-too-many-search-queries-tp3899518p3899518.html brSent from the Solr - User mailing list archive at Nabble.com. br br
RE: Re: Cloud-aware request processing?
...it is a distributed real-time query scheme... SolrCloud does this already. It treats all the shards like one-big-index, and you can query it normally to get subset results from each shard. Why do you have to re-write the query for each shard? Seems unnecessary. brbrbr--- Original Message --- On 4/9/2012 08:45 AM Benson Margulies wrote:br Jan Høydahl, br brMy problem is intimately connected to Solr. it is not a batch job for brhadoop, it is a distributed real-time query scheme. I hate to add yet branother complex framework if a Solr RP can do the job simply. br brFor this problem, I can transform a Solr query into a subset query on breach shard, and then let the SolrCloud mechanism. br brI am well aware of the 'zoo' of alternatives, and I will be evaluating brthem if I can't get what I want from Solr. br brOn Mon, Apr 9, 2012 at 9:34 AM, Jan Høydahl jan@cominvent.com wrote: br Hi, br br Instead of using Solr, you may want to have a look at Hadoop or another framework for distributed computation, see e.g. http://java.dzone.com/articles/comparison-gridcloud-computing br br -- br Jan Høydahl, search solution architect br Cominvent AS - www.cominvent.com br Solr Training - www.solrtraining.com br br On 9. apr. 2012, at 13:41, Benson Margulies wrote: br br I'm working on a prototype of a scheme that uses SolrCloud to, in br effect, distribute a computation by running it inside of a request br processor. br br If there are N shards and M operations, I want each node to perform br M/N operations. That, of course, implies that I know N. br br Is that fact available anyplace inside Solr, or do I need to just configure it? br br br
Re: How to facet data from a multivalued field?
Your handler for that field should be looked at. Try not using a handler that tokenizes or stems the field. You want to leave the text as is. I forget the handler setting for that, but its documented in there somewhere. On Mon, 2012-04-09 at 13:02 -0700, Thiago wrote: Hello everybody, I've already searched about this topic in the forum, but I didn't find any case like this. I ask for apologizes if this topic have been already discussed. I'm having a problem in faceting a multivalued field. My field is called series, and it has names of TV series like the big bang theory, two and a half men ... In this field I can have a lot of TV series names. For example: arr name=series strTwo and a Half Men/str strHow I Met Your Mother/str strThe Big Bang Theory/str /arr What I want to do is: search and count how many documents related to each series. I'm doing it using facet search in this field. But it's returning each word separately. Like this: lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=series int name=bang91/int int name=big91/int int name=half21/int int name=how45/int int name=i45/int int name=men21/int int name=met45/int int name=mother45/int int name=theori91/int int name=two21/int int name=your45/int /lst /lst lst name=facet_dates/ lst name=facet_ranges/ /lst And what I want is something like: lst name=facet_counts lst name=facet_queries/ lst name=facet_fields lst name=series int name=Two and a Half Men21/int int name=How I Met Your Mother45/int int name=The Big Bang Theory91/int /lst /lst lst name=facet_dates/ lst name=facet_ranges/ /lst Is there any possible way to do it with facet search? I don't want the terms, I just want each string including the white spaces. Do I have to change my fieldtype to do this? Thanks to everybody. Thiago -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-facet-data-from-a-multivalued-field-tp3897853p3897853.html Sent from the Solr - User mailing list archive at Nabble.com.
No webadmin for trunk?
Hi, Just updated solr trunk and tried the java -jar start.jar and localhost:8983/solr/admin.not found. Where did it go? thanks.
Re: No webadmin for trunk?
HTTP ERROR: 404 Problem accessing /solr. Reason: Not Found Powered by Jetty:// On Sat, 2012-04-07 at 09:04 -0400, Jamie Johnson wrote: just go to localhost:8983/solr and you'll see the updated interface. On Sat, Apr 7, 2012 at 8:23 AM, Darren Govoni dar...@ontrenet.com wrote: Hi, Just updated solr trunk and tried the java -jar start.jar and localhost:8983/solr/admin.not found. Where did it go? thanks.
Re: No webadmin for trunk?
start.jar has no apps in it at all. On Sat, 2012-04-07 at 09:47 -0400, Darren Govoni wrote: HTTP ERROR: 404 Problem accessing /solr. Reason: Not Found Powered by Jetty:// On Sat, 2012-04-07 at 09:04 -0400, Jamie Johnson wrote: just go to localhost:8983/solr and you'll see the updated interface. On Sat, Apr 7, 2012 at 8:23 AM, Darren Govoni dar...@ontrenet.com wrote: Hi, Just updated solr trunk and tried the java -jar start.jar and localhost:8983/solr/admin.not found. Where did it go? thanks.
Re: No webadmin for trunk?
Yep. I did all kinds of ant clean, ant dist, ant example, etc. My trunk rev. At revision 1310773. Example start.jar is broken. No webapp inside. :( On Sat, 2012-04-07 at 16:11 +0200, Rafał Kuć wrote: Hello! Did you run 'ant example' ?
Re: No webadmin for trunk?
K. There is a solr.war in the webapps directory. But still get the 404. On Sat, 2012-04-07 at 16:19 +0200, Rafał Kuć wrote: Hello! start.jar shouldn't contain any webapp. If you look at the 'example' directory, you'll notice that there is a 'webapps' directory which should contain solr.war file. Btw. revision 1307647 works without a problem. I'll checkout trunk in a few in try with the newest revision.
Re: No webadmin for trunk?
Now, it comes up. Not sure why its acting weird. Will continue to look at it. On Sat, 2012-04-07 at 10:23 -0400, Darren Govoni wrote: K. There is a solr.war in the webapps directory. But still get the 404. On Sat, 2012-04-07 at 16:19 +0200, Rafał Kuć wrote: Hello! start.jar shouldn't contain any webapp. If you look at the 'example' directory, you'll notice that there is a 'webapps' directory which should contain solr.war file. Btw. revision 1307647 works without a problem. I'll checkout trunk in a few in try with the newest revision.
Re: upgrade 3.5 to 4.0
In my opinion, its never a good idea to overwrite files of a previous version with a new version. The easiest thing would be to just deploy the solr war file into tomcat and let tomcat manage the webapp, files, etc. On Sat, 2012-04-07 at 22:39 -0400, Dan Foley wrote: I have download the nightly snapshot of v 4.0 and would like to install it to my tomcat install of solr 3.5 can i simply overwrite the current files or is there a correct method for doing so? please advise.. thanks
Re: Does any one know when Solr 4.0 will be released.
No one knows. But if you ask the devs, they will say 'when its done'. One clue might be to monitor the bugs/issues scheduled for 4.0. When they are all resolved, then its ready. On Wed, 2012-04-04 at 09:41 -0700, srinivas konchada wrote: Hello every one Does any one know when Solr 4.0 will be released? there is a specific feature that exists in 4.0 which we want to take advantage off. Problem is we cannot deploy some thing into production from trunk. We need to use an official release. Thanks Srinivas Konchada
Re: Duplicates in Facets
Try using Luke to look at your index and see if there are multiple similar TFV's. You can browse them easily in Luke. On Wed, 2012-04-04 at 23:35 -0400, Jamie Johnson wrote: I am currently indexing some information and am wondering why I am getting duplicates in facets. From what I can tell they are the same, but is there any case that could cause this that I may not be thinking of? Could this be some non printable character making it's way into the index? Sample output from luke lst name=fields lst name=organization_umvs str name=typestring/str str name=schemaI--M---OFl/str str name=dynamicBase*_umvs/str str name=index(unstored field)/str int name=docs332/int int name=distinct-1/int lst name=topTerms int name=ORGANIZATION 1328/int int name=ORGANIZATION 2124/int int name=ORGANIZATION 236/int int name=ORGANIZATION 220/int int name=ORGANIZATION 34/int /lst
Custom scoring question
Hi, I have a situation I want to re-score document relevance. Let's say I have two fields: text: The quick brown fox jumped over the white fence. terms: fox fence Now my queries come in as: terms:[* TO *] and Solr scores them on that field. What I want is to rank them according to the distribution of field terms within field text. Which is a per document calculation. Can this be done with any kind of dismax? I'm not searching for known terms at query time. If not, what is the best way to implement a custom scoring handler to perform this calculation and re-score/sort the results? thanks for any tips!!!
Re: Custom scoring question
I'm going to try index time per-field boosting and do the boost computation at index time and see if that helps. On Thu, 2012-03-29 at 10:08 -0400, Darren Govoni wrote: Hi, I have a situation I want to re-score document relevance. Let's say I have two fields: text: The quick brown fox jumped over the white fence. terms: fox fence Now my queries come in as: terms:[* TO *] and Solr scores them on that field. What I want is to rank them according to the distribution of field terms within field text. Which is a per document calculation. Can this be done with any kind of dismax? I'm not searching for known terms at query time. If not, what is the best way to implement a custom scoring handler to perform this calculation and re-score/sort the results? thanks for any tips!!!
Re: Custom scoring question
Yeah, I guess that would work. I wasn't sure if it would change relative to other documents. But if it were to be combined with other fields, that approach may not work because the calculation wouldn't include the scoring for other parts of the query. So then you have the dynamic score and what to do with it. On Thu, 2012-03-29 at 16:29 -0300, Tomás Fernández Löbbe wrote: Can't you simply calculate that at index time and assign the result to a field, then sort by that field. On Thu, Mar 29, 2012 at 12:07 PM, Darren Govoni dar...@ontrenet.com wrote: I'm going to try index time per-field boosting and do the boost computation at index time and see if that helps. On Thu, 2012-03-29 at 10:08 -0400, Darren Govoni wrote: Hi, I have a situation I want to re-score document relevance. Let's say I have two fields: text: The quick brown fox jumped over the white fence. terms: fox fence Now my queries come in as: terms:[* TO *] and Solr scores them on that field. What I want is to rank them according to the distribution of field terms within field text. Which is a per document calculation. Can this be done with any kind of dismax? I'm not searching for known terms at query time. If not, what is the best way to implement a custom scoring handler to perform this calculation and re-score/sort the results? thanks for any tips!!!
MLT and solrcloud?
Hi, It was mentioned before that SolrCloud has all the capability of regular solr (including handlers) with the exception of the MLT handler. As this is a key capability for Solr, is there work planned to include the MLT in SolrCloud? If so when? Our efforts greatly depend on it. As such, I'm happy to help anyway possible. thanks, Darren
Re: MLT and solrcloud?
Ok, I'll do what I can to help! As always, appreciate the hard work Mark. On Thu, 2012-03-22 at 17:31 -0400, Mark Miller wrote: On Mar 22, 2012, at 5:22 PM, Darren Govoni wrote: Hi, It was mentioned before that SolrCloud has all the capability of regular solr (including handlers) with the exception of the MLT handler. As this is a key capability for Solr, is there work planned to include the MLT in SolrCloud? If so when? Our efforts greatly depend on it. As such, I'm happy to help anyway possible. thanks, Darren Usually no real time tables here :) Depends on who jumps in when. Some work has already gone on for this here: https://issues.apache.org/jira/browse/SOLR-788 You might just try and jump start that issue again? As I get a free moment or two, I'm happy to help commit a solution. - Mark Miller lucidimagination.com
RE: Re: maxClauseCount Exception
true. but how can you find documents containing that field without expanding 1000 clauses? brbrbr--- Original Message --- On 3/19/2012 07:24 AM Erick Erickson wrote:brbq: So all I want to do is a simple all docs with something in this field, brand to highlight the field br brBut that doesn't really make sense to do at the Solr/Lucene level. All bryou're saying is that you want that field highlighted. Wouldn't it be much breasier to just do this at the app level whenever your field had anything brreturned in it? br brBest brErick br brOn Sat, Mar 17, 2012 at 8:07 PM, Darren Govoni dar...@ontrenet.com wrote: br Thanks for the tip Hoss. br br I notice that it appears sometimes and was varying because my index runs br would sometimes have different amount of docs, etc. br br So all I want to do is a simple all docs with something in this field, br and to highlight the field. br br Is the query expansion to all possible terms in the index really br necessary? I could have 100's of thousands of possible terms. Why should br they all become explicit query elements? Seems overkill and br underperformant. br br Is there a another way with Lucene or not really? br br On Thu, 2012-03-08 at 16:18 -0800, Chris Hostetter wrote: br : I am suddenly getting a maxClauseCount exception for no reason. I am br : using Solr 3.5. I have only 206 documents in my index. br br Unless things have changed the reason you are seeing this is because br _highlighting_ a query (clause) like type_s:[*+TO+*] requires rewriting br it into a giant boolean query of all the terms in that field -- so even if br you only have 206 docs, if you have more then 206 values in that field in br your index, you're going to go over 1024 terms. br br (you don't get this problem in a basic query, because it doens't need to br enumerate all the terms, it rewrites it to a ConstantScoreQuery) br br what you most likeley want to do, is move some of those clauses like br type_s:[*+TO+*]: and usergroup_sm:admin) out of your main q query and br into fq filters ... so they can be cached independently, won't br contribute to scoring (just matching) and won't be used in highlighting. br br : params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2} hits=204 status=500 QTime=166 |#] br br : [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1| br : org.apache.solr.servlet.SolrDispatchFilter| br : _ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery br : $TooManyClauses: maxClauseCount is set to 1024 br : at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136) br ... br : at br : org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304) br : at br : org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158) br br -Hoss br br br br br
Re: Inconsistent Results with ZooKeeper Ensemble and Four SOLR Cloud Nodes
I think he's asking if all the nodes (same machine or not) return a response. Presumably you have different ports for each node since they are on the same machine. On Sun, 2012-03-18 at 14:44 -0400, Matthew Parker wrote: The cluster is running on one machine. On Sun, Mar 18, 2012 at 2:07 PM, Mark Miller markrmil...@gmail.com wrote: From every node in your cluster you can hit http://MACHINE1:8084/solr in your browser and get a response? On Mar 18, 2012, at 1:46 PM, Matthew Parker wrote: My cloud instance finally tried to sync. It looks like it's having connection issues, but I can bring the SOLR instance up in the browser so I'm not sure why it cannot connect to it. I got the following condensed log output: org.apache.commons.httpclient.HttpMethodDirector executeWithRetry I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect org.apache.commons.httpclient.HttpMethodDirector executeWithRetry I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect org.apache.commons.httpclient.HttpMethodDirector executeWithRetry I/O exception (java.net.ConnectException) caught when processing request: Connection refused: connect Retrying request shard update error StdNode: http://MACHINE1:8084/solr/:org.apache.solr.client.solrj.SolrServerException: http://MACHINE1:8084/solr at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java: 483) .. .. .. Caused by: java.net.ConnectException: Connection refused: connect at java.net.DualStackPlainSocketImpl.connect0(Native Method) .. .. .. try and ask http://MACHINE1:8084/solr to recover Could not tell a replica to recover org.apache.solr.client.solrj.SolrServerException: http://MACHINE1:8084/solr at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483) ... ... ... Caused by: java.net.ConnectException: Connection refused: connect at java.net.DualStackPlainSocketImpl.waitForConnect(Native method) .. .. .. On Sat, Mar 17, 2012 at 10:10 PM, Mark Miller markrmil...@gmail.com wrote: Nodes talk to ZooKeeper as well as to each other. You can see the addresses they are trying to use to communicate with each other in the 'cloud' view of the Solr Admin UI. Sometimes you have to override these, as the detected default may not be an address that other nodes can reach. As a limited example: for some reason my mac cannot talk to my linux box with its default detected host address of halfmetal:8983/solr - but the mac can reach my linux box if I use halfmetal.Local - so I have to override the published address of my linux box using the host attribute if I want to setup a cluster between my macbook and linux box. Each nodes talks to ZooKeeper to learn about the other nodes, including their addresses. Recovery is then done node to node using the appropriate addresses. - Mark Miller lucidimagination.com On Mar 16, 2012, at 3:00 PM, Matthew Parker wrote: I'm still having issues replicating in my work environment. Can anyone explain how the replication mechanism works? Is it communicating across ports or through zookeeper to manager the process? On Thu, Mar 8, 2012 at 10:57 PM, Matthew Parker mpar...@apogeeintegration.com wrote: All, I recreated the cluster on my machine at home (Windows 7, Java 1.6.0.23, apache-solr-4.0-2012-02-29_09-07-30) , sent some document through Manifold using its crawler, and it looks like it's replicating fine once the documents are committed. This must be related to my environment somehow. Thanks for your help. Regards, Matt On Fri, Mar 2, 2012 at 9:06 AM, Erick Erickson erickerick...@gmail.comwrote: Matt: Just for paranoia's sake, when I was playing around with this (the _version_ thing was one of my problems too) I removed the entire data directory as well as the zoo_data directory between experiments (and recreated just the data dir). This included various index.2012 files and the tlog directory on the theory that *maybe* there was some confusion happening on startup with an already-wonky index. If you have the energy and tried that it might be helpful information, but it may also be a total red-herring FWIW Erick On Thu, Mar 1, 2012 at 8:28 PM, Mark Miller markrmil...@gmail.com wrote: I assuming the windows configuration looked correct? Yeah, so far I can not spot any smoking gun...I'm confounded at the moment. I'll re read through everything once more... - Mark
Re: maxClauseCount Exception
Thanks for the tip Hoss. I notice that it appears sometimes and was varying because my index runs would sometimes have different amount of docs, etc. So all I want to do is a simple all docs with something in this field, and to highlight the field. Is the query expansion to all possible terms in the index really necessary? I could have 100's of thousands of possible terms. Why should they all become explicit query elements? Seems overkill and underperformant. Is there a another way with Lucene or not really? On Thu, 2012-03-08 at 16:18 -0800, Chris Hostetter wrote: : I am suddenly getting a maxClauseCount exception for no reason. I am : using Solr 3.5. I have only 206 documents in my index. Unless things have changed the reason you are seeing this is because _highlighting_ a query (clause) like type_s:[*+TO+*] requires rewriting it into a giant boolean query of all the terms in that field -- so even if you only have 206 docs, if you have more then 206 values in that field in your index, you're going to go over 1024 terms. (you don't get this problem in a basic query, because it doens't need to enumerate all the terms, it rewrites it to a ConstantScoreQuery) what you most likeley want to do, is move some of those clauses like type_s:[*+TO+*]: and usergroup_sm:admin) out of your main q query and into fq filters ... so they can be cached independently, won't contribute to scoring (just matching) and won't be used in highlighting. : params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2} hits=204 status=500 QTime=166 |#] : [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1| : org.apache.solr.servlet.SolrDispatchFilter| : _ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery : $TooManyClauses: maxClauseCount is set to 1024 : at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136) ... : at : org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304) : at : org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158) -Hoss
RE: Solr 4.0 and production environments
As a rule of thumb, many will say not to go to production with a pre-release baseline. So until Solr4 goes final and stable, it's best not to assume too much about it. Second suggestion is to properly stage new technologies in your product such that they go through their own validation. And so to that end, jump right in and start using Solr4 and see for yourself! It's a great technology. brbrbr--- Original Message --- On 3/7/2012 11:47 AM Dirceu Vieira wrote:brHi All, br brHas anybody started using Solr 4.0 in production environments? Is it stable brenough? brI'm planning to create a proof of concept using solr 4.0, we have some brprojects that will gain a lot with features such as near real time search, brjoins and others, that are available only on version 4. br brIs it too risky to think of using it right now? brWhat are your thoughts and experiences with that? br brBest regards, br br-- brDirceu Vieira Júnior br--- br+47 9753 2473 brdirceuvjr.blogspot.com brtwitter.com/dirceuvjr br
Re: Building a resilient cluster
What I think was mentioned on this a bit ago is that the index stops working if one of the nodes goes down unless its a replica. You have 2 nodes running with numShards=2? Thus if one goes down the entire index is inoperable. In the future I'm hoping this changes such that the index cluster continues to operate but will lack results from the downed node. Maybe this has changed in recent trunk updates though. Not sure. On Mon, 2012-03-05 at 20:49 -0800, Ranjan Bagchi wrote: Hi Mark, So I tried this: started up one instance w/ zookeeper, and started a second instance defining a shard name in solr.xml -- it worked, searching would search both indices, and looking at the zookeeper ui, I'd see the second shard. However, when I brought the second server down -- the first one stopped working: it didn't kick the second shard out of the cluster. Any way to do this? Thanks, Ranjan From: Mark Miller markrmil...@gmail.com To: solr-user@lucene.apache.org Cc: Date: Wed, 29 Feb 2012 22:57:26 -0500 Subject: Re: Building a resilient cluster Doh! Sorry - this was broken - I need to fix the doc or add it back. The shard id is actually set in solr.xml since its per core - the sys prop was a sugar option we had setup. So either add 'shard' to the core in solr.xml, or to make it work like it does in the doc, do: core name=collection1 shard=${shard:} instanceDir=. / That sets shard to the 'shard' system property if its set, or as a default, act as if it wasn't set. I've been working with custom shard ids mainly through solrj, so I hadn't noticed this. - Mark On Wed, Feb 29, 2012 at 10:36 AM, Ranjan Bagchi ranjan.bag...@gmail.com wrote: Hi, At this point I'm ok with one zk instance being a point of failure, I just want to create sharded solr instances, bring them into the cluster, and be able to shut them down without bringing down the whole cluster. According to the wiki page, I should be able to bring up new shard by using shardId [-D shardId], but when I did that, the logs showed it replicating an existing shard. Ranjan Andre Bois-Crettez wrote: You have to run ZK on a at least 3 different machines for fault tolerance (a ZK ensemble). http://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_sha= rd_replicas_and_zookeeper_ensemble Ranjan Bagchi wrote: Hi, I'm interested in setting up a solr cluster where each machine [at least initially] hosts a separate shard of a big index [too big to sit on the machine]. I'm able to put a cloud together by telling it that I have (to start out with) 4 nodes, and then starting up nodes on 3 machines pointin= g at the zkInstance. I'm able to load my sharded data onto each machine individually and it seems to work. My concern is that it's not fault tolerant: if one of the non-zookeeper machines falls over, the whole cluster won't work. Also, I can't create = a shard with more data, and have it work within the existing cloud. I tried using -DshardId=3Dshard5 [on an existing 4-shard cluster], but it just started replicating, which doesn't seem right. Are there ways around this? Thanks, Ranjan Bagchi -- - Mark http://www.lucidimagination.com
Re: [SoldCloud] Slow indexing
A question relating to this. If you are running a single ZK node, but say 10 other nodes and then parallel index on each of those nodes, will the ZK be hit by all 10 indexing nodes constantly? i.e. very chatty? If one of those 10 indexing nodes goes down or falls out of sync and comes back, does ZK block the state of indexing until that single node catches back up? On Mar 4, 2012, at 5:43 PM, Markus Jelsma wrote: everything stalls after it lists all segment files and that a ZK state change has occured. Can you get a stack trace here? I'll try to respond to more tomorrow. What version of trunk are you using? We have been making fixes and improvements all the time, so need to get a frame of reference. When a client node cannot talk to zookeeper, because it may not know certain things it should (what if a leader changes?), it must reject updates (searches will still work). Why can't the node talk to zookeeper? Perhaps the load is so high on the server, it cannot respond to zk within the session timeout? I really don't know yet. When this happens though, it forces a recovery when/if the node can reconnect to zookeeper. We have not yet started on optimizing bulk indexing - currently an update is added locally *before* sending updates in parallel to each replica. Then we wait for each response before responding to the client. We plan to offer more optimizations and options around this. Feed back will be useful in making some of these improvements. - Mark Miller lucidimagination.com
Re: Trunk build errors
I updated yesterday and did an ant clean, ant test. I will try a clean pull next. I'm on linux. Perhaps an ant version issue? There was recently some work done to get better about checking on licenses, when did you last get trunk? About 9 days ago was the last go-round. And did you do an 'ant clean'? It works on my machine with a fresh pull this morning. Best Erick On Wed, Feb 22, 2012 at 5:27 PM, Darren Govoni dar...@ontrenet.com wrote: Hi, I am getting numerous errors preventing a build of solrcloud trunk. [licenses] MISSING LICENSE for the following file: Any tips to get a clean build working? thanks
maxClauseCount error
Hi, I am suddenly getting a maxclause count error and don't know why. I am using Solr 3.5
maxClauseCount Exception
Hi, I am suddenly getting a maxClauseCount exception for no reason. I am using Solr 3.5. I have only 206 documents in my index. Any ideas? This is wierd. QUERY PARAMS: [hl, hl.snippets, hl.simple.pre, hl.simple.post, fl, hl.mergeContiguous, hl.usePhraseHighlighter, hl.requireFieldMatch, echoParams, hl.fl, q, rows, start]|#] [#|2012-02-22T13:40:13.129-0500|INFO|glassfish3.1.1| org.apache.solr.core.SolrCore|_ThreadID=22;_ThreadName=Thread-2;|[] webapp=/solr3 path=/select params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2} hits=204 status=500 QTime=166 |#] [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1| org.apache.solr.servlet.SolrDispatchFilter| _ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery $TooManyClauses: maxClauseCount is set to 1024 at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136) at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:127) at org.apache.lucene.search.ScoringRewrite $1.addClause(ScoringRewrite.java:51) at org.apache.lucene.search.ScoringRewrite $1.addClause(ScoringRewrite.java:41) at org.apache.lucene.search.ScoringRewrite $3.collect(ScoringRewrite.java:95) at org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:38) at org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:93) at org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:98) at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:385) at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:217) at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:185) at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:205) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490) at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401) at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131) at org.apache.so
Trunk build errors
Hi, I am getting numerous errors preventing a build of solrcloud trunk. [licenses] MISSING LICENSE for the following file: Any tips to get a clean build working? thanks
filter query or boolean?
Hi, Which is faster for boolean compound expressions. filter queries or a single query with boolean expressions? For that matter, is there any difference other than maybe speed? thanks
Re: SolrJ + SolrCloud
Thanks Mark. Is there any plan to make all the Solr search handlers work with SolrCloud, like MLT? That missing feature would prohibit us from using SolrCloud at the moment. :( On Sat, 2012-02-11 at 18:24 -0500, Mark Miller wrote: On Feb 11, 2012, at 6:02 PM, Darren Govoni wrote: Hi, Do all the normal facilities of Solr work with SolrCloud from SolrJ? Things like /mlt, /cluster, facets , tvf's, etc. Darren SolrJ works the same in SolrCloud mode as it does in non SolrCloud mode - it's fully supported. There is even a new SolrJ client called CloudSolrServer that has built in cluster awareness and load balancing. In terms of what is supported - anything that is supported with distributed search - that is most things, but there is the odd man out - like MLT - looks like an issue is open here: https://issues.apache.org/jira/browse/SOLR-788 but it's not resolved yet. - Mark Miller lucidimagination.com
SolrJ + SolrCloud
Hi, Do all the normal facilities of Solr work with SolrCloud from SolrJ? Things like /mlt, /cluster, facets , tvf's, etc. Darren
Re: Range facet - Count in facet menu != Count in search results
Double check your default operator for a faceted search vs. regular search. I caught this difference in my work that explained this difference. On Fri, 2012-02-10 at 07:45 -0800, Yuhao wrote: Jay, Was the curly closing bracket } intentional? I'm using 3.4, which also supports fq=price:[10 TO 20]. The problem is the results are not working properly. From: Jan Høydahl jan@cominvent.com To: solr-user@lucene.apache.org; Yuhao nfsvi...@yahoo.com Sent: Thursday, February 9, 2012 7:45 PM Subject: Re: Range facet - Count in facet menu != Count in search results Hi, If you use trunk (4.0) version, you can say fq=price:[10 TO 20} and have the upper bound be exclusive. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 10. feb. 2012, at 00:58, Yuhao wrote: I've changed the facet.range.include option to every possible value (lower, upper, edge, outer, all)**. It only changes the count shown in the Ranges facet menu on the left. It has no effect on the count and results shown in search results, which ALWAYS is inclusive of both the lower AND upper bounds (which is equivalent to include = all). Is this by design? I would like to make the search results include the lower bound, but not the upper bound. Can I do that? My range field is multi-valued, but I don't think that should be the problem. ** Actually, it doesn't like outer for some reason, which leaves the facet completely empty.
Re: SolrCloud is in trunk.
Good job on this work. A monumental effort. On Wed, 8 Feb 2012 16:41:13 -0500, Mark Miller markrmil...@gmail.com wrote: For those that are interested and have not noticed, the latest work on SolrCloud and distributed indexing is now in trunk. SolrCloud is our name for a new set of distributed capabilities that improve upon the old style distributed search and index based replication. It provides for high availability and fault tolerance while allowing for near realtime search and an interface that matches what you are used to with previous versions of Solr. We are looking to release this in the next 4.0 release, and any feedback early users can provide will be very useful. So if you have an interest in these types of features, please take the latest trunk build for a spin and provide some feedback. There is still a lot more planned, so feel free to chime in on what you would like to see - this is essentially the end of stage one. You can read more about what we have done on the wiki: http://wiki.apache.org/solr/SolrCloud Also, a couple blog posts I recently saw pop up: http://blog.sematext.com/2012/02/01/solrcloud-distributed-realtime-search http://outerthought.org/blog/491-ot.html I'll contribute my own blog post as well when I get a chance, but there should be a fair amount of info there to get you started if you are interested. Thanks, - Mark Miller lucidimagination.com
Re: SolrCloud war?
UPDATE: I set my app server[1] system property jetty.port to be equal to the app servers open port and was able to get two Solr shards to talk. The overall properties I set are: App server domain 1: bootstrap_confdir collection.configName jetty.port solr.solr.home zkRun App server domain 2: bootstrap_confdir collection.configName jetty.port solr.solr.home zkHost I deployed each war app into the /solr context. I presume its needed by remote URL addressing. I checked the zookeeper config page and it shows both shards. Awesome. [1] Glassfish 3.1.1 On 02/01/2012 08:50 PM, Mark Miller wrote: I have not yet tried to run SolrCloud in another app server, but it shouldn't be a problem. One issue you might have is the fact that we count on hostPort coming from the system property jetty.port. This is set in the default solr.xml - the hostPort defaults to jetty.port. You probably want to explicitly pass -DhostPort= if you are not going to use jetty.port. - Mark Miller lucidimagination.com On Feb 1, 2012, at 2:44 PM, Darren Govoni wrote: Hi, I'm trying to get the SolrCloud2 examples to work using a war deployed solr into glassfish. The startup properties must be different in this case, because its having trouble connecting to zookeeper when I deploy the solr war file. Perhaps the embedded zookeeper has trouble running in an app server? Any tips appreciated! Darren On 01/30/2012 06:58 PM, Darren Govoni wrote: Hi, Is there any issue with running the new SolrCloud deployed as a war in another app server? Has anyone tried this yet? thanks.