from:"darren"

So thanks to the tireless efforts of David Smiley and the devs at Vivid
Solutions (not to mention the various contributors that help power Solr and
Lucene) spatial search is awesome, efficient and easy.  The biggest
roadblock I've run into is not having the JTS (Java Topology Suite) JAR
where Solr can find it. It doesn't ship with Solr OOB so you have to either
add it to one of the dynamic directories, or bundle it with the WAR (I
think pre-5.0). The link above has most of what you need to index data and
issue queries. I'd also suggest the sections on spatial search in Solr In
Action (Grainger, Potter) - they add a few more use cases that I've found
interesting. Finally, the aging wiki has some good info too:

http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4

Basically indexing spatial data is as easy as anything else: define the
field in the solrconfig.xml, create the data and push it in. Now the data
in this case are boxes or polygons (effectively the same here) and come in
a specific format known as WKT, or Well-Known-Text
https://en.wikipedia.org/wiki/Well-known_text. I'd say unless you're
aiming at an advanced use case set the max dist error on the field config a
little higher than normal - precision isn't really a requirement here and
good unit tests would alert you to any unforeseen issues. Then for the
query side of the world you just ask for point inclusion like:

q=+polygon:Contains(POINT(my_long my_lat))

Please note that WKT reverses the order of lat/lng because it uses
euclidean geometry heuristics (so X=longitude and Y=latitude). Can't tell
you how many times my brain hurt thanks to this idiom combined with janky
client logic :) Anyway, that's about it - let me know if you have any other
questions.


On Wed, Aug 26, 2015 at 1:56 PM, O. Klein kl...@octoweb.nl wrote:

 Darren,

 This sounds like solution I'm looking for. Especially nice fix for the
 Sunday-Monday problem.

 Never worked with spatial search before, so any pointers are welcome.

 Will start working on this solution.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225443.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Darren

Re: Search opening hours

Sorry - didn't finish my thought. I need to address querying :) So using
the above to define what's in the index your queries for a day/time become
a CONTAINS operation against the field. Let's say that the field is defined
as a location_rpt using JTS and its Spatial Factory (which supports
polygons) - oh, and it would need to be multi-valued. Querying the field
would require first translating now or in an hour or Monday at 9am to
a geocode, then hitting the index with a CONTAINS request per the docs:

https://cwiki.apache.org/confluence/display/solr/Spatial+Search


On Wed, Aug 26, 2015 at 11:23 AM, Darren Spehr darre...@gmail.com wrote:

 Sure - and sorry for its density. I reread it and thought the same ;)

 So imagine a polygon of say 1/2 mile width (I made that up) that stretches
 around the equator. Let's call this a week's timeline and subdivide it into
 7 blocks, one for each day. For the sake of simplicity assume it's a line
 (which I forget but is supported in Solr as an infinitely small polygon)
 starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for
 Sunday at 11:59 PM. By subdivide you can think of it either radially or by
 longitude, but you have 360 degrees to divide into 7, which means that
 every hour is represented by a range of roughly 2.143 degrees (360/7/24).
 These regions represent each day and hour (or less), and the region
 boundaries represent midnight for the day before.

 Now for indexing - your open hours then become a combination of these
 subdivisions. If you're open 24x7 then the whole polygon is indexed. If
 you're only open on Monday from 9-5 then only the polygon between
 (0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you
 can index any combination of times this way.

 So now the varsity question is how to do this with a fluctuating calendar?
 I think this example can be extended to include searching against any given
 day of the week in a year, or years. Just imagine a translation layer that
 adjusts the latitude N or S by some amount to represent which day in which
 year you're looking for. Make sense?

 On Wed, Aug 26, 2015 at 10:50 AM, Upayavira u...@odoko.co.uk wrote:

 delightfully dense = really intriguing, but I couldn't quite
 understand it - really hoping for more info

 On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote:
  Darren,
 
  That was delightfully dense. Do you think you could unpack it a bit
  more? Possibly some sample (pseudo) queries?
 
  Upayavira
 
  On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
   If you wanted to try a spatial approach that blended times like above,
   you
   could try a polygon of minimum width that spans the globe - this is
   literally using spatial search (geocodes) against time. So in this
   scenario
   you logically subdivide the polygon into 7 distinct regions (for days)
   and
   then within this you can defined, like a timeline, what open and
 closed
   means. The problem of 3AM is taken care of because of it's continuous
   nature - ie one day is adjacent to the next, with Sunday and Monday
   backing
   up to each other. Just a thought.
  
   On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote:
  
   
   
On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
 Those options don't fix my problem with closing times the next
 morning,
 or is
 there a way to do this?
   
Use the spatial model, and a time window of a week. There are 10,080
minutes in a week, so you could use that as your scale.
   
Assuming the week starts at 00:00 Monday morning, you might index
 Monday
9:00-23:00 as  540:1380
   
Tuesday 9am-Wednesday 1am would be 1980:2940
   
You convert your NOW time into a minutes since Monday 00:00 and
 do a
spatial search within that time.
   
If it is now Monday, 11:23am, that would be 11*60+23=683, so you
 would
do a search for 683:683.
   
If you have a shop that is open over Sunday night to Monday, you
 just
list it as open until Sunday 23:59 and open again Monday 00:00.
   
Would that do it?
   
Upayavira
   
  
  
  
   --
   Darren




 --
 Darren




-- 
Darren

Re: Search opening hours

If you wanted to try a spatial approach that blended times like above, you
could try a polygon of minimum width that spans the globe - this is
literally using spatial search (geocodes) against time. So in this scenario
you logically subdivide the polygon into 7 distinct regions (for days) and
then within this you can defined, like a timeline, what open and closed
means. The problem of 3AM is taken care of because of it's continuous
nature - ie one day is adjacent to the next, with Sunday and Monday backing
up to each other. Just a thought.

On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote:



 On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
  Those options don't fix my problem with closing times the next morning,
  or is
  there a way to do this?

 Use the spatial model, and a time window of a week. There are 10,080
 minutes in a week, so you could use that as your scale.

 Assuming the week starts at 00:00 Monday morning, you might index Monday
 9:00-23:00 as  540:1380

 Tuesday 9am-Wednesday 1am would be 1980:2940

 You convert your NOW time into a minutes since Monday 00:00 and do a
 spatial search within that time.

 If it is now Monday, 11:23am, that would be 11*60+23=683, so you would
 do a search for 683:683.

 If you have a shop that is open over Sunday night to Monday, you just
 list it as open until Sunday 23:59 and open again Monday 00:00.

 Would that do it?

 Upayavira




-- 
Darren

Re: Search opening hours

Sure - and sorry for its density. I reread it and thought the same ;)

So imagine a polygon of say 1/2 mile width (I made that up) that stretches
around the equator. Let's call this a week's timeline and subdivide it into
7 blocks, one for each day. For the sake of simplicity assume it's a line
(which I forget but is supported in Solr as an infinitely small polygon)
starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for
Sunday at 11:59 PM. By subdivide you can think of it either radially or by
longitude, but you have 360 degrees to divide into 7, which means that
every hour is represented by a range of roughly 2.143 degrees (360/7/24).
These regions represent each day and hour (or less), and the region
boundaries represent midnight for the day before.

Now for indexing - your open hours then become a combination of these
subdivisions. If you're open 24x7 then the whole polygon is indexed. If
you're only open on Monday from 9-5 then only the polygon between
(0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you
can index any combination of times this way.

So now the varsity question is how to do this with a fluctuating calendar?
I think this example can be extended to include searching against any given
day of the week in a year, or years. Just imagine a translation layer that
adjusts the latitude N or S by some amount to represent which day in which
year you're looking for. Make sense?

On Wed, Aug 26, 2015 at 10:50 AM, Upayavira u...@odoko.co.uk wrote:

 delightfully dense = really intriguing, but I couldn't quite
 understand it - really hoping for more info

 On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote:
  Darren,
 
  That was delightfully dense. Do you think you could unpack it a bit
  more? Possibly some sample (pseudo) queries?
 
  Upayavira
 
  On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
   If you wanted to try a spatial approach that blended times like above,
   you
   could try a polygon of minimum width that spans the globe - this is
   literally using spatial search (geocodes) against time. So in this
   scenario
   you logically subdivide the polygon into 7 distinct regions (for days)
   and
   then within this you can defined, like a timeline, what open and closed
   means. The problem of 3AM is taken care of because of it's continuous
   nature - ie one day is adjacent to the next, with Sunday and Monday
   backing
   up to each other. Just a thought.
  
   On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote:
  
   
   
On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
 Those options don't fix my problem with closing times the next
 morning,
 or is
 there a way to do this?
   
Use the spatial model, and a time window of a week. There are 10,080
minutes in a week, so you could use that as your scale.
   
Assuming the week starts at 00:00 Monday morning, you might index
 Monday
9:00-23:00 as  540:1380
   
Tuesday 9am-Wednesday 1am would be 1980:2940
   
You convert your NOW time into a minutes since Monday 00:00 and do
 a
spatial search within that time.
   
If it is now Monday, 11:23am, that would be 11*60+23=683, so you
 would
do a search for 683:683.
   
If you have a shop that is open over Sunday night to Monday, you just
list it as open until Sunday 23:59 and open again Monday 00:00.
   
Would that do it?
   
Upayavira
   
  
  
  
   --
   Darren




-- 
Darren

Solr 4.10.3 start up issue

2015-01-21 Thread Darren Spehr

Hi everyone -

I posted a question on stackoverflow but in hindsight this would have been
a better place to start. Below is the link.

Basically I can't get the example working when using an external ZK cluster
and auto-core discovery. Solr 4.10.1 works fine, but the newest release
never gets new nodes into the active state. There are no errors or
warnings, and compared to the log output of 4.10.1, the difference is that
nodes never make it to leader election.

Here is the stackoverflow question, along with the full log output:
http://stackoverflow.com/questions/28004832/solr-4-10-3-is-not-proceeding-to-leader-election-on-new-cluster-startup-hangs

Any help and guidance would be appreciated. Thanks!

-- 
Darren

Re: Solr 4.10.3 start up issue

2015-01-21 Thread Darren Spehr

Thanks Hoss, this is exactly what I needed. I had previously run the
example using nothing more than an external ZK hosting my own
configuration. This of course means one of two things - my conf was bad, or
Solr was at fault. The conf has been working for ages so I didn't test a
replacement (it's amazing how a little frustration can fuel such hubris). I
had thought to do this before - and should have; I uploaded the full
example collection configuration to ZK just now and tried again. Magic, it
worked, which left me feeling a bit glum. Well, happy that it wasn't Solr.
Now if you'll excuse me, I have a conf review to perform.

Darren

On Wed, Jan 21, 2015 at 6:48 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:

: I posted a question on stackoverflow but in hindsight this would have
been
: a better place to start. Below is the link.
:
: Basically I can't get the example working when using an external ZK
cluster
: and auto-core discovery. Solr 4.10.1 works fine, but the newest release

your SO URL shows the output of using your custom configs, but not what
you got with the example configs -- so it's not clear to me if there is
really just one problem, or perhaps 2?

you also mentioned a lot of details about how you are using solr with zk,
and what doens't work, but it's not clear if you tried other simpler steps
using your configs -- or the example configs -- and if those simpler *did*
work (ie: single node solr startup?)

my best guess, based on the logs you did post and the mention of
lib/mq/solr-search-ahead-2.0.0.jar in those logs, is that the entire
question of zk and slcuster state and leaders is a red herring, and what
you are running into is: SOLR-6643...

https://issues.apache.org/jira/browse/SOLR-6643

...if i'm right, then simple core discovery with your configs on a single
node solr instance w/o any knowledge of ZK will also fail to init the core
-- and if you try to use the CoreAdmin API to CREATE a core, you'll ge
some kind of LinkageError.

: Here is the stackoverflow question, along with the full log output:
:
http://stackoverflow.com/questions/28004832/solr-4-10-3-is-not-proceeding-to-leader-election-on-new-cluster-startup-hangs

-Hoss
http://www.lucidworks.com/

--
Darren

RE: SolrCloud replica dies under high throughput

2014-07-23 Thread Darren Lee

Thanks that helped. I no longer see the constant replica recovery. It also 
increased my throughput to 1.6/1.7 million per hour reliably. I actually then 
tried using SSDs instead and it flew up to 6.5 million updates per hour.

Setup:
4 node cluster using m3.2xl AWS servers using general purpose SSDs.

Thanks again,
Darren


-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: 22 July 2014 00:25
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud replica dies under high throughput

Looks like you probably have to raise the http client connection pool limits to 
handle that kind of load currently.

They are specified as top level config in solr.xml:

maxUpdateConnections
maxUpdateConnectionsPerHost

--
Mark Miller
about.me/markrmiller

On July 21, 2014 at 7:14:59 PM, Darren Lee (d...@amplience.com) wrote:
 Hi,
  
 I'm doing some benchmarking with Solr Cloud 4.9.0. I am trying to work 
 out exactly how much throughput my cluster can handle.
  
 Consistently in my test I see a replica go into recovering state 
 forever caused by what looks like a timeout during replication. I can 
 understand the timeout and failure (I am hitting it fairly hard) but 
 what seems odd to me is that when I stop the heavy load it still does 
 not recover the next time it tries, it seems broken forever until I manually 
 go in, clear the index and let it do a full resync.
  
 Is this normal? Am I misunderstanding something? My cluster has 4 
 nodes (2 shards, 2 replicas) (AWS m3.2xlarge). I am indexing with ~800 
 concurrent connections and a 10 sec soft commit.
 I consistently get this problem with a throughput of around 1.5 
 million documents per hour.
  
 Thanks all,
 Darren
  
  
 Stack Traces  Messages:
  
 [qtp779330563-627] ERROR org.apache.solr.servlet.SolrDispatchFilter â 
 null:org.apache.http.conn.ConnectionPoolTimeoutException:  
 Timeout waiting for connection from pool at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnecti
 on(PoolingClientConnectionManager.java:226)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnecti
 on(PoolingClientConnectionManager.java:195)
 at 
 org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequ
 estDirector.java:422) at 
 org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpC
 lient.java:863) at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpC
 lient.java:82) at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpC
 lient.java:106) at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpC
 lient.java:57) at 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.ru
 n(ConcurrentUpdateSolrServer.java:233)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
 ava:1145) at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
 java:615) at java.lang.Thread.run(Thread.java:724)
  
 Error while trying to recover. 
 core=assets_shard2_replica1:java.util.concurrent.ExecutionException:  
 org.apache.solr.client.solrj.SolrServerException: IOException occured 
 when talking to server at: http://xxx.xxx.15.171:8080/solr at 
 java.util.concurrent.FutureTask.report(FutureTask.java:122)
 at java.util.concurrent.FutureTask.get(FutureTask.java:188)
 at 
 org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStr
 ategy.java:615) at 
 org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.jav
 a:371) at 
 org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235)
 Caused by: org.apache.solr.client.solrj.SolrServerException: 
 IOException occured when talking to server at: 
 http://xxx.xxx.15.171:8080/solr at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSol
 rServer.java:566) at 
 org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer
 .java:245) at 
 org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer
 .java:241) at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
 ava:1145) at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
 java:615) at java.lang.Thread.run(Thread.java:744)
 Caused by: java.net.SocketException: Socket closed at 
 java.net.SocketInputStream.socketRead0(Native Method) at 
 java.net.SocketInputStream.read(SocketInputStream.java:152)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
 at 
 org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(Abstract
 SessionInputBuffer.java:160) at 
 org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer
 .java:84) at 
 org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSe
 ssionInputBuffer.java:273) at 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultH
 ttpResponseParser.java:140) at 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead

SolrCloud replica dies under high throughput

2014-07-21 Thread Darren Lee

Hi,

I'm doing some benchmarking with Solr Cloud 4.9.0. I am trying to work out 
exactly how much throughput my cluster can handle.

Consistently in my test I see a replica go into recovering state forever caused 
by what looks like a timeout during replication. I can understand the timeout 
and failure (I am hitting it fairly hard) but what seems odd to me is that when 
I stop the heavy load it still does not recover the next time it tries, it 
seems broken forever until I manually go in, clear the index and let it do a 
full resync.

Is this normal? Am I misunderstanding something? My cluster has 4 nodes (2 
shards, 2 replicas) (AWS m3.2xlarge). I am indexing with ~800 concurrent 
connections and a 10 sec soft commit. I consistently get this problem with a 
throughput of around 1.5 million documents per hour.

Thanks all,
Darren


Stack Traces  Messages:

[qtp779330563-627] ERROR org.apache.solr.servlet.SolrDispatchFilter  â 
null:org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for 
connection from pool
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
at 
org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:422)
at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

Error while trying to recover. 
core=assets_shard2_replica1:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.SolrServerException: IOException occured when 
talking to server at: http://xxx.xxx.15.171:8080/solr
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:615)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:371)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235)
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException 
occured when talking to server at: http://xxx.xxx.15.171:8080/solr
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer.java:245)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer.java:241)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123

SolrCloud - Highly Reliable / Scalable Resources?

2014-05-12 Thread Darren Lee

Hi everyone,

We have been using Solr Cloud (4.4) for ~ 6 months now. Functionally its 
excellent but we have suffered several issues which always seem quite 
problematic to resolve.

I was wondering if anyone in the community can recommend good resources / 
reading for setting up a highly scalable / highly reliable cluster. A lot of 
what I see in the solr documentation is aimed at small setups or is quite 
sparse.

Dealing with topics like:

* Capacity planning

* Losing nodes

* Voting panic

* Recovery failure

* Replication factors

* Elasticity / Auto scaling / Scaling recipes

* Exhibitor

* Container configuration, concurrency limits, packet drop tuning

* Increasing capacity without downtime

* Scalable approaches to full indexing hundreds of millions of documents

* External health check vs CloudSolrServer

* Separate vs local zookeeper

* Benchmarks


Sorry, I know that's a lot to ask heh. We are going to run a project for a 
month or so soon where we re-write all our run books and do deeper testing on 
various failure scenarios and the above but any starting point would be much 
appreciated.

Thanks all,
Darren

SolrCloud - Highly Reliable / Scalable Info

2014-05-12 Thread Darren Lee

Hi everyone,

We have been using Solr Cloud (4.4) for ~ 6 months now. Functionally its 
excellent but we have suffered several issues which always seem quite 
problematic to resolve.

I was wondering if anyone in the community can recommend good resources / 
reading for setting up a highly scalable / highly reliable cluster. A lot of 
what I see in the solr documentation is aimed at small setups or is quite 
sparse.

Dealing with topics like:

* Capacity planning

* Losing nodes

* Voting panic

* Recovery failure

* Replication factors

* Elasticity / Auto scaling / Scaling recipes

* Exhibitor

* Container configuration, concurrency limits, packet drop tuning

* Increasing capacity without downtime

* Scalable approaches to full indexing hundreds of millions of documents

* External health check vs CloudSolrServer

* Separate vs local zookeeper

* Benchmarks


Sorry, I know that's a lot to ask heh. We are going to run a project for a 
month or so soon where we re-write all our run books and do deeper testing on 
various failure scenarios and the above but any starting point would be much 
appreciated.

Thanks all,
Darren

MLT in SolrJ vs. URL?

2013-05-21 Thread Darren Govoni

Hi,
  I compose a mlt query in a URL and get the queried result back and a
list of  documents in the moreLikeThis section in my browser.

When I try to execute the same query in SolrJ setting the same params, I
only get the queried result document back and no MLT docs.

What's the trick here?

thanks,
Darren

Re: zk Config URL?

2013-02-25 Thread Darren Govoni

(AbstractInhabitantImpl.java:78)
at 
com.sun.enterprise.v3.server.AppServerStartup.run(AppServerStartup.java:253)
at 
com.sun.enterprise.v3.server.AppServerStartup.doStart(AppServerStartup.java:145)
at 
com.sun.enterprise.v3.server.AppServerStartup.start(AppServerStartup.java:136)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishImpl.start(GlassFishImpl.java:79)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishDecorator.start(GlassFishDecorator.java:63)
at 
com.sun.enterprise.glassfish.bootstrap.osgi.OSGiGlassFishImpl.start(OSGiGlassFishImpl.java:69)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishMain$Launcher.launch(GlassFishMain.java:117)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishMain.main(GlassFishMain.java:97)

at com.sun.enterprise.glassfish.bootstrap.ASMain.main(ASMain.java:55)
Caused by: java.lang.ClassNotFoundException: javax.servlet.Filter
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at sun.misc.Launcher$ExtClassLoader.findClass(Launcher.java:229)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 55 more


On 02/24/2013 08:32 PM, Mark Miller wrote:

You either have to specifically upload a config set or use one of the bootstrap 
sys props.

Are you doing either?

- Mark

On Feb 24, 2013, at 8:15 PM, Darren Govoni dar...@ontrenet.com wrote:


Thanks Michael.

I went ahead and just started an external zookeeper, but my solr node throws 
exceptions from it.

Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find 
configName for collection collection1 found:null

...

[#|2013-02-24T20:13:58.451-0500|SEVERE|glassfish3.1.2|org.apache.solr.core.CoreContainer|_ThreadID=28;_ThreadName=Thread-2;|null:org.apache.solr.common.SolrException:
 Unable to create core: collection1
at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find 
configName for collection collection1 found:null
at org.apache.solr.cloud.ZkController.getConfName(ZkController.java:1097)
at 
org.apache.solr.cloud.ZkController.createCollectionZkNode(ZkController.java:1016)
at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:937)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031)
... 10 more


On 02/24/2013 07:21 PM, Michael Della Bitta wrote:

Hello Darren,

If you go into the admin and click on Cloud, you'll see that
information represented in a number of ways. Both Dump and Tree
(especially the clusterstate.json file) have this information
represented as a document in JSON format.

If you don't see the Cloud navigation on the left side of the admin
screen, that's a good indication that Solr hasn't connected to
Zookeeper.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Sun, Feb 24, 2013 at 6:34 PM, Darren Govoni dar...@ontrenet.com wrote:

Hi,
I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't
find that shows me the zookeeper config XML,
so I can check what other nodes are connected? Can't seem to find it.

I deploy my solrcloud war into glassfish and set jetty.port (among other
properties) to the GF domain port (e.g. 8181).'
It starts successfully.

I want zookeeper to run automatically within (as needed). How can I verify
this or refer to
the first/master server using zkHost from another node? (e.g. {host}:{port})
to form a cluster.

I did this before a while ago, before solr 4.x was released, but things have
changed.

tips appreciated. thank you.
Darren

Re: zk Config URL?

2013-02-25 Thread darren

Ok. But its way too complicated than it should be. It should work smarter.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Anirudha Jadhav aniru...@nyu.edu 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: zk Config URL? 
 
Solr cloud reads solr cfg files from zookeeper.

You need to push the cfg to zookeeper  link collection to cfg.
This is exactly what mark suggested earlier in the thread. This is also
explained in solr cloud wiki.

On Monday, February 25, 2013, Darren Govoni wrote:

 Hi Mark,

    I download latest zk, and run it.

    In my glassfish server, I set these system wide properties:

 numShards = 1
 zkHost = 10.x.x.x:2181
 jetty.port = 8080 (port of my domain)
 bootstrap_config = true

 I copy all the solr 4.1 dist/*.jar into my glassfish domain lib/ext
 directory. Then I deploy solr 4.1 war.
 It throws this exception always.

 [#|2013-02-25T13:31:32.304+**|INFO|glassfish3.1.2|**
 javax.enterprise.system.**container.web.com.sun.**
 enterprise.web|_ThreadID=10;_**ThreadName=Thread-2;|WEB0171: Created
 virtual server [__asadmin]|#]

 [#|2013-02-25T13:31:32.768+**|INFO|glassfish3.1.2|**
 javax.enterprise.system.**container.web.com.sun.**
 enterprise.web|_ThreadID=10;_**ThreadName=Thread-2;|WEB0172: Virtual
 server [server] loaded default web module []|#]

 [#|2013-02-25T13:31:34.222+**|WARNING|glassfish3.1.2|**
 javax.enterprise.system.tools.**deployment.org.glassfish.**
 deployment.common|_ThreadID=**10;_ThreadName=Thread-2;|**DPL8007:
 Unsupported deployment descriptors element schemaLocation value
 http://www.bea.com/ns/**weblogic/90 http://www.bea.com/ns/weblogic/90
 http://www.bea.com/ns/**weblogic/90/weblogic-web-app.**xsd|#http://www.bea.com/ns/weblogic/90/weblogic-web-app.xsd%7C#
 ]

 [#|2013-02-25T13:31:34.223+**|SEVERE|glassfish3.1.2|**
 javax.enterprise.system.tools.**deployment.org.glassfish.**
 deployment.common|_ThreadID=**10;_ThreadName=Thread-2;|**DPL8006: get/add
 descriptor failure : filter-dispatched-requests-**enabled TO false|#]

 [#|2013-02-25T13:31:34.831+**|SEVERE|glassfish3.1.2|**
 javax.enterprise.system.**container.web.com.sun.**
 enterprise.web|_ThreadID=10;_**ThreadName=Thread-2;|**WebModule[/solr1]PWC1270:
 Exception starting filter SolrRequestFilter
 java.lang.**NoClassDefFoundError: javax/servlet/Filter
 at java.lang.ClassLoader.**defineClass1(Native Method)
 at java.lang.ClassLoader.**defineClassCond(ClassLoader.**java:631)
 at java.lang.ClassLoader.**defineClass(ClassLoader.java:**615)
 at java.security.**SecureClassLoader.defineClass(**
 SecureClassLoader.java:141)
 at java.net.URLClassLoader.**defineClass(URLClassLoader.**java:283)
 at java.net.URLClassLoader.**access$000(URLClassLoader.**java:58)
 at java.net.URLClassLoader$1.run(**URLClassLoader.java:197)
 at java.security.**AccessController.doPrivileged(**Native Method)
 at java.net.URLClassLoader.**findClass(URLClassLoader.java:**190)
 at sun.misc.Launcher$**ExtClassLoader.findClass(**Launcher.java:229)
 at java.lang.ClassLoader.**loadClass(ClassLoader.java:**306)
 at java.lang.ClassLoader.**loadClass(ClassLoader.java:**295)
 at com.sun.enterprise.v3.server.**APIClassLoaderServiceImpl$**
 APIClassLoader.loadClass(**APIClassLoaderServiceImpl.**java:206)
 at java.lang.ClassLoader.**loadClass(ClassLoader.java:**295)
 at java.lang.ClassLoader.**loadClass(ClassLoader.java:**295)
 at java.lang.ClassLoader.**loadClass(ClassLoader.java:**247)
 at org.glassfish.web.loader.**WebappClassLoader.loadClass(**
 WebappClassLoader.java:1456)
 at org.glassfish.web.loader.**WebappClassLoader.loadClass(**
 WebappClassLoader.java:1359)
 at org.apache.catalina.core.**ApplicationFilterConfig.**
 loadFilterClass(**ApplicationFilterConfig.java:**280)
 at org.apache.catalina.core.**ApplicationFilterConfig.**getFilter(**
 ApplicationFilterConfig.java:**250)
 at org.apache.catalina.core.**ApplicationFilterConfig.init**
 (ApplicationFilterConfig.java:**120)
 at org.apache.catalina.core.**StandardContext.filterStart(**
 StandardContext.java:4685)
 at org.apache.catalina.core.**StandardContext.start(**
 StandardContext.java:5377)
 at com.sun.enterprise.web.**WebModule.start(WebModule.**java:498)
 at org.apache.catalina.core.**ContainerBase.**addChildInternal(**
 ContainerBase.java:917)
 at org.apache.catalina.core.**ContainerBase.addChild(**
 ContainerBase.java:901)
 at org.apache.catalina.core.**StandardHost.addChild(**
 StandardHost.java:733)
 at com.sun.enterprise.web.**WebContainer.loadWebModule(**
 WebContainer.java:2019)
 at com.sun.enterprise.web.**WebContainer.loadWebModule(**
 WebContainer.java:1669)
 at com.sun.enterprise.web.**WebApplication.start(**
 WebApplication.java:109)
 at org.glassfish.internal.data.**EngineRef.start(EngineRef.**java:130)
 at org.glassfish.internal.data.**ModuleInfo.start(ModuleInfo.**
 java:269

zk Config URL?

2013-02-24 Thread Darren Govoni


Hi,
   I'm trying the latest solrcloud 4.1. Is there a button(or url) I 
can't find that shows me the zookeeper config XML,

so I can check what other nodes are connected? Can't seem to find it.

I deploy my solrcloud war into glassfish and set jetty.port (among other 
properties) to the GF domain port (e.g. 8181).'

It starts successfully.

I want zookeeper to run automatically within (as needed). How can I 
verify this or refer to
the first/master server using zkHost from another node? (e.g. 
{host}:{port}) to form a cluster.


I did this before a while ago, before solr 4.x was released, but things 
have changed.


tips appreciated. thank you.
Darren

Re: zk Config URL?

2013-02-24 Thread Darren Govoni


Thanks Michael.

I went ahead and just started an external zookeeper, but my solr node 
throws exceptions from it.


Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not 
find configName for collection collection1 found:null


...

[#|2013-02-24T20:13:58.451-0500|SEVERE|glassfish3.1.2|org.apache.solr.core.CoreContainer|_ThreadID=28;_ThreadName=Thread-2;|null:org.apache.solr.common.SolrException: 
Unable to create core: collection1
at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654)

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not 
find configName for collection collection1 found:null
at 
org.apache.solr.cloud.ZkController.getConfName(ZkController.java:1097)
at 
org.apache.solr.cloud.ZkController.createCollectionZkNode(ZkController.java:1016)
at 
org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:937)

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031)
... 10 more


On 02/24/2013 07:21 PM, Michael Della Bitta wrote:

Hello Darren,

If you go into the admin and click on Cloud, you'll see that
information represented in a number of ways. Both Dump and Tree
(especially the clusterstate.json file) have this information
represented as a document in JSON format.

If you don't see the Cloud navigation on the left side of the admin
screen, that's a good indication that Solr hasn't connected to
Zookeeper.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Sun, Feb 24, 2013 at 6:34 PM, Darren Govoni dar...@ontrenet.com wrote:

Hi,
I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't
find that shows me the zookeeper config XML,
so I can check what other nodes are connected? Can't seem to find it.

I deploy my solrcloud war into glassfish and set jetty.port (among other
properties) to the GF domain port (e.g. 8181).'
It starts successfully.

I want zookeeper to run automatically within (as needed). How can I verify
this or refer to
the first/master server using zkHost from another node? (e.g. {host}:{port})
to form a cluster.

I did this before a while ago, before solr 4.x was released, but things have
changed.

tips appreciated. thank you.
Darren

RE: SolrJ and Solr 4.0 | doc.getFieldValue() returns String instead of Date

2013-01-08 Thread Darren Govoni


SimpleDateFormat df= new SimpleDateFormat(-MM-dd'T'hh:mm:ss.S'Z');
Date dateObj = df.parse(2009-10-29T00:00:009Z);

brbrbr--- Original Message ---
On 1/8/2013  09:34 AM uwe72 wrote:brA Lucene 4.0 document returns for a Date 
field now a string value, instead of
bra Date object.
br
brfield name=ModuleImpl.versionAsDate view=Datenstand type=date 
br

brSolr4.0 -- 2009-10-29T00:00:009Z
brSolr3.6 -- Date instance
br
brCan this be set somewhere in the config?
br
brI prefer to receive a date instance
br
br
br
br--
brView this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-and-Solr-4-0-doc-getFieldValue-returns-String-instead-of-Date-tp4031588.html
brSent from the Solr - User mailing list archive at Nabble.com.
br

RE: RE: Max number of core in Solr multi-core

2013-01-07 Thread Darren Govoni

This should be clarified some. In the client API, SolrServer is represents a
connection to a single server backend/endpoint and should be re-used where possible.

The approach being discussed is to have one client connection (represented by SolrServer class) per solr core, all residing in a single solr server (as is the case below, but not required).

brbrbr--- Original Message ---
On 1/7/2013 08:06 AM Jay Parashar wrote:brThis is the exact approach we use
in our multithreaded env. One server per
brcore. I think this is the recommended approach.
br
br-Original Message-
brFrom: Parvin Gasimzade [mailto:parvin.gasimz...@gmail.com]
brSent: Monday, January 07, 2013 7:00 AM

brTo: solr-user@lucene.apache.org
brSubject: Re: Max number of core in Solr multi-core
br
brI know that but my question is different. Let me ask it in this way.
br
brI have a solr with base url localhost:8998/solr and two solr core as
brlocalhost:8998/solr/core1 and localhost:8998/solr/core2.
br
brI have one baseSolr instance initialized as :
brSolrServer server = new HttpSolrServer( url );
br
brI have also create SolrServer's for each core as :
brSolrServer core1 = new HttpSolrServer( url + /core1 ); SolrServer core2 =
brnew HttpSolrServer( url + /core2 );
br
brSince there are many cores, I have to initialize SolrServer as shown above.
brIs there a way to create only one SolrServer with the base url and access
breach core using it? If it is possible, then I don't need to create new
brSolrServer for each core.
br
brOn Mon, Jan 7, 2013 at 2:39 PM, Erick Erickson
brerickerick...@gmail.comwrote:
br
br This might help:
br https://wiki.apache.org/solr/Solrj#HttpSolrServer
br
br Note that the associated SolrRequest takes the path, I presume
br relative to the base URL you initialized the HttpSolrServer with.

br
br Best
br Erick
br
br
br On Mon, Jan 7, 2013 at 7:02 AM, Parvin Gasimzade
br parvin.gasimz...@gmail.com

br wrote:
br
br Thank you for your responses. I have one more question related to
br Solr multi-core.
br By using SolrJ I create new core for each application. When user
br wants to add data or make query on his application, I create new
br HttpSolrServer

br for
br this core. In this scenario there will be many running
br HttpSolrServer instances.

br
br Is there a better solution? Does it cause a problem to run many
br instances at the same time?

br
br On Wed, Jan 2, 2013 at 5:35 PM, Per Steffensen st...@designware.dk
br wrote:
br
br g a collection per application instead of a core
br
br
br
br

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren

This is a good explanation and makes sense. The one inconsistency is referring 
to a replica of a shard that has no replication. But its not that big of a 
problem. If you wove the term 'core' into your writeup below it would be 
complete and should be posted on the wiki.



Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Jack Krupansky j...@basetechnology.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Replication makes perfect sense even if our explanations so far do not.

A shard is an abstraction of a subset of the data for a collection.

A replica is an instance of the data of the shard and instances of Solr 
servers that have indicated a readiness to service queries and updates for 
the data. Alternatively, a replica is a node which has indicated a readiness 
to receive and serve the data of a shard, but may not have any data at the 
moment.

Lets describe it operationally for SolrCloud: If data comes in to any 
replica of a shard it will automatically and quickly be replicated to all 
other replicas of the shard. If a new replica of a shard comes up it will be 
streamed all of the data from the another replica of the shard. If an 
existing replica of a shard restarts or reconnects to the cluster, it will 
be streamed updates of any new data since it was last updated from another 
replica of the shard.

Replication is simply the process of assuring that all replicas are kept up 
to date. That's the same abstract meaning as for Master/Slave even though 
the operational details are somewhat different. The goal remains the same.

Replication factor is the number of instances of the data of the shard and 
instances of Solr servers that can service queries and updates for the data. 
Alternatively, the replication factor is the number of nodes of the 
SolrCloud cluster  which have indicated a readiness to receive and serve the 
data of a shard, but may not have any data at the moment.

A node is an instance of Solr running in a Java JVM that has indicated to 
the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a 
replica for a shard of a collection. [The latter part of that is a bit too 
fuzzy - I'm not sure what the node tells Zookeeper and who does shard 
assignment. I mean, does a node explicitly say what shard it wants to be, or 
is that assigned by Zookeeper, or is that a node's choice/option? But none 
of that changes the fact that a node registers with Zookeeper and then 
somehow becomes a replica for a shard.]

A node (instance of a Solr server) can be a replica of shards from multiple 
collections (potentially multiple shards per collection). A node is not a 
replica per se, but a container that can serve multiple collections. A node 
can serve as multiple replicas, each of a different collection.

My only interest here on this user list is to understand and explain the 
terms we have today and that SEEM to be working for the most part, even 
though we may not have defined them carefully enough and used them 
consistently enough.

If somebody want to propose an alternative terminology - fine, discuss that 
on the dev list and/or file a Jira.

I won't claim that my definitions are perfect (yet), but perfecting the 
definitions (for users) should be separated from changing the terms 
themselves.

-- Jack Krupansky

-Original Message- 
From: Per Steffensen
Sent: Friday, January 04, 2013 2:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

On 1/3/13 5:58 PM, Walter Underwood wrote:
 A factor is multiplied, so multiplying the leader by a replicationFactor 
 of 1 means you have exactly one copy of that shard.

 I think that recycling the term replication within Solr was confusing, 
 but it is a bit late to change that.

 wunder
Yes, the term factor is not misleading, but the term replication is.
If we keep calling shard-instances for Replica I guess replicaFactor
will be ok - at least much better than replicationFactor. But it would
still be better with e.g. ShardInstance and InstancesPerShard

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren

Yes. Thats it. Its clear if we separate logical terms from physical terms. A 
simple cake diagram on the wiki along with perhaps a uml will solidify these 
concepts.

Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Jack Krupansky j...@basetechnology.com 
Date:  
To: solr-user@lucene.apache.org,darren dar...@ontrenet.com 
Subject: Re: Terminology question: Core vs. Collection vs... 

I thought about adding Solr core, but it only muddies the water. Yes, it 
needs to be added, but carefully.

In the context of SolrCloud, a Solr core is the underlying representation of 
a replica. Alternatively, a replica of a shard of a collection is 
implemented as a Solr core. [Need to factor in the potential for multiple 
shards on a single node.] Or, a Solr core is capable of serving as a replica 
of a shard. A Solr core has a collection name but can exist without being 
registered with Zookeeper, so it may not be a replica of a 
zookeeper-registered collection.

Something like that. Not quite there yet.

The main point, I think, is that when we talk about SolrCloud or a Solr 
cluster it would be better for people to speak of replicas and shards and 
collections than cores since core is the implementation rather than the 
abstraction. I mean, at the level of cores, they know of only documents and 
fields, not shards, replicas, and the overall structure of collections and 
the cluster. Sure, the core has the name of the collection, but cores on 
other nodes can use that same name.

-- Jack Krupansky

-Original Message- 
From: darren
Sent: Friday, January 04, 2013 9:00 AM
To: j...@basetechnology.com ; solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

This is a good explanation and makes sense. The one inconsistency is 
referring to a replica of a shard that has no replication. But its not that 
big of a problem. If you wove the term 'core' into your writeup below it 
would be complete and should be posted on the wiki.

Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Jack Krupansky j...@basetechnology.com
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Replication makes perfect sense even if our explanations so far do not.

A shard is an abstraction of a subset of the data for a collection.

A replica is an instance of the data of the shard and instances of Solr
servers that have indicated a readiness to service queries and updates for
the data. Alternatively, a replica is a node which has indicated a readiness
to receive and serve the data of a shard, but may not have any data at the
moment.

Lets describe it operationally for SolrCloud: If data comes in to any
replica of a shard it will automatically and quickly be replicated to all
other replicas of the shard. If a new replica of a shard comes up it will be
streamed all of the data from the another replica of the shard. If an
existing replica of a shard restarts or reconnects to the cluster, it will
be streamed updates of any new data since it was last updated from another
replica of the shard.

Replication is simply the process of assuring that all replicas are kept up
to date. That's the same abstract meaning as for Master/Slave even though
the operational details are somewhat different. The goal remains the same.

Replication factor is the number of instances of the data of the shard and
instances of Solr servers that can service queries and updates for the data.
Alternatively, the replication factor is the number of nodes of the
SolrCloud cluster  which have indicated a readiness to receive and serve the
data of a shard, but may not have any data at the moment.

A node is an instance of Solr running in a Java JVM that has indicated to
the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a
replica for a shard of a collection. [The latter part of that is a bit too
fuzzy - I'm not sure what the node tells Zookeeper and who does shard
assignment. I mean, does a node explicitly say what shard it wants to be, or
is that assigned by Zookeeper, or is that a node's choice/option? But none
of that changes the fact that a node registers with Zookeeper and then
somehow becomes a replica for a shard.]

A node (instance of a Solr server) can be a replica of shards from multiple
collections (potentially multiple shards per collection). A node is not a
replica per se, but a container that can serve multiple collections. A node
can serve as multiple replicas, each of a different collection.

My only interest here on this user list is to understand and explain the
terms we have today and that SEEM to be working for the most part, even
though we may not have defined them carefully enough and used them
consistently enough.

If somebody want to propose an alternative terminology - fine, discuss that
on the dev list and/or file a Jira.

I won't claim that my definitions are perfect (yet

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren

Agreed. But for completeness can it be node/collection/shard/replica/core?

Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Yonik Seeley yo...@lucidworks.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 

On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote:
 Our biggest problem is that we really havent decided once and for all and
 made sure to reflect the decision consistently across code and
 documentation. As long as we havnt I believe it is still ok to change our
 minds.

IMO, I *think* it's settled: It's collection consists of 1 or more
shards, which each consist of one or more replicas.

A *long* time ago (3 years actually), I tried to get slice used in
place of shard just because shard was already used ambiguously by
people for both physical and logical shards, but it never caught on,
and as I recall no one could really agree on a set of terms that
satisfied everyone.  Attempting to replace Replica with something
like Shard Instance could actually end up being worse since it's a
mouthful and people would tend to shorten it to shard when talking
about it.

From a practical standpoint, I don't think people will be confused by
the current terminology once we document it well (we should probably
start with collection/shard/replica).  It's mostly an issue of when
one goes looking for inconsistencies or things that might not make
sense.  And as has been pointed out, others use the exact same
terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication

In the *code* I have been migrating away from shard as the physical
kind.  I've also used slice as a synonym for logical shard in the
code because of this mixed history of shard and since removing all
remnants of the use of shard as physical all at once would be
impractical.  Anyone who works on the code should not be bothered by
an extra synonym, and things will continue to be cleaned up over time.

-Yonik
http://lucidworks.com

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren

Actually. Node/collection/shard/replica/core/index

Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: darren dar...@ontrenet.com 
Date:  
To: yo...@lucidworks.com,solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 

Agreed. But for completeness can it be node/collection/shard/replica/core?

Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Yonik Seeley yo...@lucidworks.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 

On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote:
 Our biggest problem is that we really havent decided once and for all and
 made sure to reflect the decision consistently across code and
 documentation. As long as we havnt I believe it is still ok to change our
 minds.

IMO, I *think* it's settled: It's collection consists of 1 or more
shards, which each consist of one or more replicas.

A *long* time ago (3 years actually), I tried to get slice used in
place of shard just because shard was already used ambiguously by
people for both physical and logical shards, but it never caught on,
and as I recall no one could really agree on a set of terms that
satisfied everyone.  Attempting to replace Replica with something
like Shard Instance could actually end up being worse since it's a
mouthful and people would tend to shorten it to shard when talking
about it.

From a practical standpoint, I don't think people will be confused by
the current terminology once we document it well (we should probably
start with collection/shard/replica).  It's mostly an issue of when
one goes looking for inconsistencies or things that might not make
sense.  And as has been pointed out, others use the exact same
terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication

In the *code* I have been migrating away from shard as the physical
kind.  I've also used slice as a synonym for logical shard in the
code because of this mixed history of shard and since removing all
remnants of the use of shard as physical all at once would be
impractical.  Anyone who works on the code should not be bothered by
an extra synonym, and things will continue to be cleaned up over time.

-Yonik
http://lucidworks.com

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren

My understanding is core is a logical solr term. Index is a physical lucene 
term. A solr core is backed by a physical lucene index. One index per core. 
Solr team can correct me if its not accurate. :)

Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Alexandre Rafalovitch arafa...@gmail.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 

Can I just start by saying that this was AMAZING. :-) When I asked the
question, I certainly did not expect this level of details.

And I vote on the cake diagram for WIKI as well. Perhaps, two with the
first one showing the trivial collapsed state of single
collection/shard/replica/core. The trivial one will also help to explain
why the example is now called 'collection1'.

I think I followed everything, except for just added term of 'index'. Isn't
that the same as 'core'? Or can we have several indexes in one core?

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:

 This is the containment hierarchy i understand but includes both physical
 and logical.

  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...

 Actually. Node/collection/shard/replica/core/index

  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: yo...@lucidworks.com,solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...

 Agreed. But for completeness can it be node/collection/shard/replica/core?

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren

I agree. In my opinion index is a low level lucene thing. I never say a 
collection has an index directly. That confuses levels and creates confusion. 
To me at least. I think the terminology discussed is good. Just some lingering 
usage inconsistencies.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Alexandre Rafalovitch arafa...@gmail.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Hmm. Doesn't that make (logical) index=collection? And (physical)
index=core? Which creates duplication of terminology and at the same time
can cause confusion between highest logical and lowest physical level.

Regards,
   Alex.
P.s. Hoping not to start a new terminology war.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 1:21 PM, Jack Krupansky j...@basetechnology.comwrote:

 The entire collection does have an index - a distributed index - which
 consists of a Lucene index on each core/replica for the subset of the data
 in that shard.

 -- Jack Krupansky

 -Original Message- From: Alexandre Rafalovitch
 Sent: Friday, January 04, 2013 1:12 PM
 To: solr-user@lucene.apache.org

 Subject: Re: Terminology question: Core vs. Collection vs...

 Can I just start by saying that this was AMAZING. :-) When I asked the
 question, I certainly did not expect this level of details.

 And I vote on the cake diagram for WIKI as well. Perhaps, two with the
 first one showing the trivial collapsed state of single
 collection/shard/replica/core. The trivial one will also help to explain
 why the example is now called 'collection1'.

 I think I followed everything, except for just added term of 'index'. Isn't
 that the same as 'core'? Or can we have several indexes in one core?

 Regards,
   Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: 
 http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:

  This is the containment hierarchy i understand but includes both physical
 and logical.

  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: dar...@ontrenet.com,yonik@**lucidworks.com yo...@lucidworks.com,
 solr-user@**lucene.apache.org solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...

 Actually. Node/collection/shard/replica/**core/index



  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: 
 yo...@lucidworks.com,solr-**u...@lucene.apache.orgsolr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...


 Agreed. But for completeness can it be node/collection/shard/replica/**
 core?

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren

Good point. Agree.

Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Upayavira u...@odoko.co.uk 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 

Using your terminology, I'd say core is a physical solr term, and index
is a pysical lucene term. A collection or a shard is a logical solr
term.

Upayavira

On Fri, Jan 4, 2013, at 06:28 PM, darren wrote:
 My understanding is core is a logical solr term. Index is a physical
 lucene term. A solr core is backed by a physical lucene index. One index
 per core. Solr team can correct me if its not accurate. :)

 Sent from my Verizon Wireless 4G LTE Smartphone

  Original message 
 From: Alexandre Rafalovitch arafa...@gmail.com 
 Date:  
 To: solr-user@lucene.apache.org 
 Subject: Re: Terminology question: Core vs. Collection vs... 

 Can I just start by saying that this was AMAZING. :-) When I asked the
 question, I certainly did not expect this level of details.

 And I vote on the cake diagram for WIKI as well. Perhaps, two with the
 first one showing the trivial collapsed state of single
 collection/shard/replica/core. The trivial one will also help to explain
 why the example is now called 'collection1'.

 I think I followed everything, except for just added term of 'index'.
 Isn't
 that the same as 'core'? Or can we have several indexes in one core?

 Regards,
    Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

 On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:

  This is the containment hierarchy i understand but includes both physical
  and logical.

   Original message 
  From: darren dar...@ontrenet.com
  Date:
  To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
  Subject: Re: Terminology question: Core vs. Collection vs...

  Actually. Node/collection/shard/replica/core/index

   Original message 
  From: darren dar...@ontrenet.com
  Date:
  To: yo...@lucidworks.com,solr-user@lucene.apache.org
  Subject: Re: Terminology question: Core vs. Collection vs...

  Agreed. But for completeness can it be node/collection/shard/replica/core?

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Darren Govoni

Yes. In that case, core should best be described as a logical solr 
entity with various managed attributes
and qualities above the physical layer (sorry, not trying to perpetuate 
this thread so much).


On 01/04/2013 01:55 PM, Mark Miller wrote:

Currently a SolrCore is 1:1 with a low level Lucene index. There is no reason 
that needs to alway be that way. It's possible that we may at some point add 
built in micro sharding support that means a SolrCore could have multiple 
underlying Lucene indexes. Or we may not.

- Mark


On Jan 4, 2013, at 1:49 PM, darren dar...@ontrenet.com wrote:


Good point. Agree.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Upayavira u...@odoko.co.uk
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Using your terminology, I'd say core is a physical solr term, and index
is a pysical lucene term. A collection or a shard is a logical solr
term.

Upayavira

On Fri, Jan 4, 2013, at 06:28 PM, darren wrote:

My understanding is core is a logical solr term. Index is a physical
lucene term. A solr core is backed by a physical lucene index. One index
per core. Solr team can correct me if its not accurate. :)


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Alexandre Rafalovitch arafa...@gmail.com
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...
   
Can I just start by saying that this was AMAZING. :-) When I asked the

question, I certainly did not expect this level of details.

And I vote on the cake diagram for WIKI as well. Perhaps, two with the
first one showing the trivial collapsed state of single
collection/shard/replica/core. The trivial one will also help to explain
why the example is now called 'collection1'.

I think I followed everything, except for just added term of 'index'.
Isn't
that the same as 'core'? Or can we have several indexes in one core?

Regards,
Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:


This is the containment hierarchy i understand but includes both physical
and logical.

 Original message 
From: darren dar...@ontrenet.com
Date:
To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Actually. Node/collection/shard/replica/core/index



 Original message 
From: darren dar...@ontrenet.com
Date:
To: yo...@lucidworks.com,solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...


Agreed. But for completeness can it be node/collection/shard/replica/core?

RE: Re: Terminology question: Core vs. Collection vs...

Good write up.

And what about node?

I think there needs to be an official glossary of terms that is sanctioned by the solr
team and some terms still ni use may need to be labeled deprecated. After so
many years, its still confusing.

brbrbr--- Original Message ---
On 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern term and incorporates the fact that the
brcollection may be sharded, with each shard on one or more cores, with each
brcore being a replica of the other cores within that shard of that
brcollection.

br
brInstance is a general term, but is commonly used to refer to a running Solr
brserver, each of which can service any number of cores. A sharded collection
brwould typically require multiple instances of Solr, each with a shard of the
brcollection.

br
brMultiple collections can be supported on a single instance of Solr. They
brdon't have to be sharded or replicated. But if they are, each Solr instance
brwill have a copy or replica of the data (index) of one shard of each sharded
brcollection - to the degree that each collection needs that many shards.

br
brAt the API level, you talk to a Solr instance, using a host and port, and
brgiving the collection name. Some operations will refer only to the portion
brof a multi-shard collection on that Solr instance, but typically Solr will
brdistribute the operation, whether it be an update or a query, to all of
brthe shards of the named collection. In the case of update, the update will
brbe distributed to all replicas as well, but in the case of query only one
brreplica of each shard of the collection is needed.

br
brBefore SolrCloud we Solr had master and slave and the slaves were replicas
brof the master, but with SolrCloud there is no master and all the replicas of
brthe shard are peers, although at any moment of time one of them will be
brconsidered the leader for coordination purposes, but not in the sense that
brit is a master of the other replicas in that shard. A SolrCloud replica is a
brreplica of the data, in an abstract sense, for a single shard of a
brcollection. A SolrCloud replica is more of an instance of the data/index.

br
brAn index exists at two levels: the portion of a collection on a single Solr
brcore will have a Lucene index, but collectively the Lucene indexes for the
brshards of a collection can be referred to the index of the collection. Each
brreplica is a copy or instance of a portion of the collection's index.

br
brThe term slice is sometimes used to refer collectively to all of the
brcores/replicas of a single shard, or sometimes to a single replica as it
brcontains only a slice of the full collection data.

br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Alexandre Rafalovitch

brSent: Thursday, January 03, 2013 4:42 AM
brTo: solr-user@lucene.apache.org
brSubject: Terminology question: Core vs. Collection vs...
br
brHello,
br
brI am trying to understand the core Solr terminology. I am looking for
brcorrect rather than loose meaning as I am trying to teach an example that
brstarts from easy scenario and may scale to multi-core, multi-machine
brsituation.
br
brHere are the terms that seem to be all overlapping and/or crossing over in
brmy mind a the moment.
br
br1) Index
br2) Core
br3) Collection
br4) Instance
br5) Replica (Replica of _what_?)
br6) Others?
br
brI tried looking through documentation, but either there is a terminology
brdrift or I am having trouble understanding the distinctions.
br
brIf anybody has a clear picture in their mind, I would appreciate a
brclarification.
br
brRegards,
br Alex.
br
brPersonal blog: http://blog.outerthoughts.com/
brLinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
br- Time is the quality of nature that keeps events from happening all at
bronce. Lately, it doesn't seem to be working. (Anonymous - via GTD book)
br

RE: Re: Terminology question: Core vs. Collection vs...

Thanks again. (And sorry to jump into this convo)

But I had a question on your statement:

On 1/3/2013 08:07 AM Jack Krupansky wrote:
brCollection is the more modern term and incorporates the fact that the
brcollection may be sharded, with each shard on one or more cores, with each
brcore being a replica of the other cores within that shard of that
brcollection.

A collection is sharded, meaning it is distributed across cores. A shard itself
is not distributed across cores in the same since. Rather a shard exist on a
single core and is replicated on other cores. Is that right? The way its worded
above, it sounds like a shard can also be sharded...

brbrbr--- Original Message ---
On 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a cluster or cloud (graph). It could be a real
brmachine or a virtualized machine. Technically, you could have multiple
brvirtual nodes on the same physical box. Each Solr replica would be on a
brdifferent node.

br
brTechnically, you could have multiple Solr instances running on a single
brhardware node, each with a different port. They are simply instances of
brSolr, although you could consider each Solr instance a node in a Solr cloud
bras well, a virtual node. So, technically, you could have multiple replicas
bron the same node, but that sort of defeats most of the purpose of having
brreplicas in the first place - to distribute the data for performance and
brfault tolerance. But, you could have replicas of different shards on the
brsame node/box for a partial improvement of performance and fault tolerance.

br
brA Solr cloud' is really a cluster.
br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:16 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brGood write up.
br
brAnd what about node?
br
brI think there needs to be an official glossary of terms that is sanctioned
brby the solr team and some terms still ni use may need to be labeled
brdeprecated. After so many years, its still confusing.

br
brbrbrbr--- Original Message ---
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more modern
brterm and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach

brbrcore being a replica of the other cores within that shard of that
brbrcollection.
brbr
brbrInstance is a general term, but is commonly used to refer to a running
brSolr
brbrserver, each of which can service any number of cores. A sharded
brcollection
brbrwould typically require multiple instances of Solr, each with a shard of
brthe

brbrcollection.
brbr
brbrMultiple collections can be supported on a single instance of Solr. They
brbrdon't have to be sharded or replicated. But if they are, each Solr
brinstance
brbrwill have a copy or replica of the data (index) of one shard of each
brsharded

brbrcollection - to the degree that each collection needs that many shards.
brbr
brbrAt the API level, you talk to a Solr instance, using a host and port,
brand
brbrgiving the collection name. Some operations will refer only to the
brportion
brbrof a multi-shard collection on that Solr instance, but typically Solr
brwill
brbrdistribute the operation, whether it be an update or a query, to all
brof
brbrthe shards of the named collection. In the case of update, the update
brwill
brbrbe distributed to all replicas as well, but in the case of query only
brone

brbrreplica of each shard of the collection is needed.
brbr
brbrBefore SolrCloud we Solr had master and slave and the slaves were
brreplicas
brbrof the master, but with SolrCloud there is no master and all the
brreplicas of

brbrthe shard are peers, although at any moment of time one of them will be
brbrconsidered the leader for coordination purposes, but not in the sense
brthat
brbrit is a master of the other replicas in that shard. A SolrCloud replica
bris a

brbrreplica of the data, in an abstract sense, for a single shard of a
brbrcollection. A SolrCloud replica is more of an instance of the
brdata/index.

brbr
brbrAn index exists at two levels: the portion of a collection on a single
brSolr
brbrcore will have a Lucene index, but collectively the Lucene indexes for
brthe
brbrshards of a collection can be referred to the index of the collection.
brEach

brbrreplica is a copy or instance of a portion of the collection's index.
brbr
brbrThe term slice is sometimes used to refer collectively to all of the
brbrcores/replicas of a single shard, or sometimes to a single replica as it
brbrcontains only a slice of the full collection data.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message-
brbrFrom: Alexandre Rafalovitch

brbrSent: Thursday, January 03, 2013 4:42 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: Terminology question: Core vs. Collection vs...
brbr
brbrHello,
brbr
brbrI am trying

RE: Re: Terminology question: Core vs. Collection vs...

Thanks. I got that part.

A group of shards (and therefore cores) represent a collection, yes. But a single shard exist only on a single core?

brbrbr--- Original Message ---
On 1/3/2013 09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.

br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.

br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Darren Govoni

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...

br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real

brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra

brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas

brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.

brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message-
brbrFrom: Darren Govoni

brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern

brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith

brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning

brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of

brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey

brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.

brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,

brbrand
brbrbrgiving the collection name. Some operations will refer only to the
brbrportion
brbrbrof a multi-shard collection on that Solr instance, but typically
brSolr

brbrwill
brbrbrdistribute the operation, whether it be an update or a query, to
brall

brbrof
brbrbrthe shards of the named collection. In the case of update, the
brupdate

brbrwill
brbrbrbe distributed to all replicas as well, but in the case of query
bronly

brbrone
brbrbrreplica of each shard of the collection is needed.
brbrbr
brbrbrBefore SolrCloud we Solr had master and slave and the slaves were
brbrreplicas
brbrbrof the master, but with SolrCloud there is no master and all the
brbrreplicas of
brbrbrthe shard are peers, although at any moment of time one of them will
brbe
brbrbrconsidered the leader

RE: Re: Terminology question: Core vs. Collection vs...


I think what's confusing about your explanation below is when you have a 
situation where there is no replication factor. That's possible too, yes?

So in that case, is each core of a shard of a collection, still referred to as a replica? 


To me a replica is a duplicate/backup of a shard's core. Not the sharded core 
itself. Or is there just no difference. Even a non-replicated core is called a 
replica?


brbrbr--- Original Message ---
On 1/3/2013  09:08 AM Jack Krupansky wrote:brOops... let me word that a 
little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach brcore being a replica of the other cores within that shard of that
brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of
brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey
brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.
brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,
brbrand
brbrbrgiving the collection name. Some operations will refer only to the
brbrportion
brbrbrof a multi-shard collection on that Solr instance, but typically
brSolr
brbrwill
brbrbrdistribute the operation, whether it be an update

RE: Re: Terminology question: Core vs. Collection vs...

Yes. And its worth to note that when having multiple shards in a single
node(@deprecated) that they are shards of different collections...

brbrbr--- Original Message ---
On 1/3/2013 09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an
brinstance of a Solr server.

br
brAnd, technically, you can have multiple shards in a single instance of Solr,
brseparating the logical sharding of keys from the distribution of the data.

br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message-
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br brCollection is the more modern term and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach brcore being a replica of the other cores within that shard of that
brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message-
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of
brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey
brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.
brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,
brbrand
brbrbrgiving

RE: Re: Terminology question: Core vs. Collection vs...

Ah, ok. Good. Makes sense.

I think I will draw all this up in a UML that includes the distinction between the
logical terms and the physical terms (and their mapping) as they do get
intertwined. I'll post it here when I'm done.

brbrbr--- Original Message ---
On 1/3/2013 09:19 AM Jack Krupansky wrote:brA single shard MAY exist on a single core, but only if it is not replicated.
brGenerally, a single shard will exist on multiple cores, each a replica of
brthe source data as it comes into the update handler.

br
br-- Jack Krupansky
br
br-Original Message-
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 9:10 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks. I got that part.
br
brA group of shards (and therefore cores) represent a collection, yes. But a
brsingle shard exist only on a single core?

br
brbrbrbr--- Original Message ---
brOn 1/3/2013 09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or
brslice) of the collection. Sharding is a way of
brbrslicing the original data, before we talk about how the shards get
brstored
brbrand replicated on actual Solr cores. Replicas are instances of the data
brfor

brbra shard.
brbr
brbrSometimes people may loosely speak of a replica as being a shard, but
brbrthat's just loose use of the terminology.
brbr
brbrSo, we're not sharding shards, but we are replicating shards.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message-
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:51 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrThanks again. (And sorry to jump into this convo)
brbr
brbrBut I had a question on your statement:
brbr
brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:
brbr brCollection is the more modern term and incorporates the fact that
brthe
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach brcore being a replica of the other cores within that shard of
brthat

brbrbrcollection.
brbr
brbrA collection is sharded, meaning it is distributed across cores. A shard
brbritself is not distributed across cores in the same since. Rather a shard
brbrexist on a single core and is replicated on other cores. Is that right?
brThe

brbrway its worded above, it sounds like a shard can also be sharded...
brbr
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013 08:28 AM Jack Krupansky wrote:brA node is a machine in a
brbrcluster or cloud (graph). It could be a real
brbrbrmachine or a virtualized machine. Technically, you could have
brmultiple
brbrbrvirtual nodes on the same physical box. Each Solr replica would be
bron

brbra
brbrbrdifferent node.
brbrbr
brbrbrTechnically, you could have multiple Solr instances running on a
brsingle
brbrbrhardware node, each with a different port. They are simply instances
brof
brbrbrSolr, although you could consider each Solr instance a node in a
brSolr

brbrcloud
brbrbras well, a virtual node. So, technically, you could have multiple
brbrreplicas
brbrbron the same node, but that sort of defeats most of the purpose of
brhaving
brbrbrreplicas in the first place - to distribute the data for performance
brand
brbrbrfault tolerance. But, you could have replicas of different shards on
brthe

brbrbrsame node/box for a partial improvement of performance and fault
brbrtolerance.
brbrbr
brbrbrA Solr cloud' is really a cluster.
brbrbr
brbrbr-- Jack Krupansky
brbrbr
brbrbr-Original Message-
brbrbrFrom: Darren Govoni

brbrbrSent: Thursday, January 03, 2013 8:16 AM
brbrbrTo: solr-user@lucene.apache.org
brbrbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbrbr
brbrbrGood write up.
brbrbr
brbrbrAnd what about node?
brbrbr
brbrbrI think there needs to be an official glossary of terms that is
brbrsanctioned
brbrbrby the solr team and some terms still ni use may need to be labeled
brbrbrdeprecated. After so many years, its still confusing.
brbrbr
brbrbrbrbrbr--- Original Message ---
brbrbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:brCollection is the
brmore

brbrmodern
brbrbrterm and incorporates the fact that the
brbrbrbrcollection may be sharded, with each shard on one or more cores,
brbrwith
brbrbreach
brbrbrbrcore being a replica of the other cores within that shard of
brthat

brbrbrbrcollection.
brbrbrbr
brbrbrbrInstance is a general term, but is commonly used to refer to a
brbrrunning
brbrbrSolr
brbrbrbrserver, each of which can service any number of cores. A sharded
brbrbrcollection
brbrbrbrwould typically require multiple instances of Solr, each with a
brbrshard of
brbrbrthe
brbrbrbrcollection.
brbrbrbr
brbrbrbrMultiple collections can be supported on a single instance of
brSolr.

brbrThey
brbrbrbrdon't have to be sharded or replicated. But if they are, each
brSolr

brbrbrinstance
brbrbrbrwill have a copy or replica of the data (index) of one

RE: Re: Terminology question: Core vs. Collection vs...


Great point.

brbrbr--- Original Message ---
On 1/3/2013  10:42 AM Per Steffensen wrote:brOn 1/3/13 4:33 PM, Mark Miller 
wrote:
br This has pretty much become the standard across other distributed systems 
and in the literat…err…books.
brHmmm Im not sure you are right about that. Maybe more than one 
brdistributed system calls them Replica, but there is also a lot that 
brdoesnt. But if you are right, thats at least a good valid argument to do 
brit this way, even though I generally prefer good logical naming over 
brfollowing bad naming from the industry :-) Just because there is a lot 
brof crap out there, doesnt mean that we also want to make crap. Maybe 
brgood logical naming could even be a small entry in the Why Solr is 
brbetter than its competitors list :-)

br

RE: Re: Terminology question: Core vs. Collection vs...

And based on the previous explanation there is never a copy of a shard. A
shard represents and contains only replicas for itself, replicas being copies of cores
within the shard.

brbrbr--- Original Message ---
On 1/3/2013 11:58 AM Walter Underwood wrote:brA factor is multiplied, so
multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that
shard.
br
brI think that recycling the term replication within Solr was confusing, but it is a bit late to change that.
br

brwunder
br
brOn Jan 3, 2013, at 7:33 AM, Mark Miller wrote:
br
br This has pretty much become the standard across other distributed systems
and in the literat…err…books.
br
br I first implemented it as you mention you'd like, but Yonik correctly pointed out that we were going against the grain.
br
br - Mark
br
br On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote:
br
br For the same reasons that Replica shouldnt be called Replica (it requires to long an explanation to agree that it is an ok name), replicationFactor shouldnt be called replicationFactor and long as it referes to the TOTAL number of cores you get for your Shard. replicationFactor would be an ok name if replicationFactor=0 meant one core, replicationFactor=1 meant two cores etc., but as long as replicationFactor=1 means one core, replicationFactor=2 means two cores, it is bad naming (you will not get any replication with replicationFactor=1 - WTF!?!?). If we want to insist that you specify the total number of cores at least use replicaPerShard instead of replicationFactor, or even better rename Replica to Shard-instance and use instancesPerShard instead of replicationFactor.
br
br Regards, Per Steffensen
br
br On 1/3/13 3:52 PM, Per Steffensen wrote:

br Hi
br
br Here is my version - do not believe the explanations have been very clear
br
br We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard)

br 1) Machines (virtual or physical) running Solr server JVMs (one machine
can run several Solr server JVMs if you like)
br 2) Solr server JVMs
br 3) Logical stores where you can add/update/delete data-instances (closest to
logical tables in RDBMS)
br 4) Logical slices of a store (closest to non-overlapping logical sets of rows
for the logical table in a RDBMS)
br 5) Physical instances of slices (a physical (disk/memory) instance of the a logical
slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical
concepts
br
br Terminology

br 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is
called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer
to a Solr server JVM.
br 2) Node
br 3) Collection
br 4) Shard. Used to be called Slice but I believe now it is officially called
Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this
logical/non-physical concept - just needs to be reflected it across documentation and code
br 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name
Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue
(if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately
obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core)
contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code.
br
br Regards, Per Steffensen
br
br
br

br--
brWalter Underwood
brwun...@wunderwood.org
br
br
br
br

Re: Terminology question: Core vs. Collection vs...

I see. So sharding and distributing/replicating can have separate and 
different advantages.


On 01/03/2013 01:06 PM, Lance Norskog wrote:
Also, searching can be much faster if you put all of the shards on one 
machine, and the search distributor. That way, you search with 
multiple simultaneous threads inside one machine. I've seen this make 
searches several times faster.


On 01/03/2013 06:36 AM, Jack Krupansky wrote:
Ah... the multiple shards (of the same collection) in a single node 
is about planning for future expansion of your cluster - create more 
shards than you need today, put more of them on a single node and 
then migrate them to their own nodes as the data outgrows the smaller 
number of nodes. In other words, add nodes incrementally without 
having to reindex all the data.


-- Jack Krupansky

-Original Message- From: Darren Govoni
Sent: Thursday, January 03, 2013 9:18 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Yes. And its worth to note that when having multiple shards in a 
single node(@deprecated) that they are shards of different 
collections...


brbrbr--- Original Message ---
On 1/3/2013  09:16 AM Jack Krupansky wrote:brAnd I would revise 
node to note that in SolrCloud a node is simply an

brinstance of a Solr server.
br
brAnd, technically, you can have multiple shards in a single 
instance of Solr,
brseparating the logical sharding of keys from the distribution of 
the data.

br
br-- Jack Krupansky
br
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding 
is a way of
brslicing the original data, before we talk about how the shards 
get stored
brand replicated on actual Solr cores. Replicas are instances of 
the data for

bra shard.
br
brSometimes people may loosely speak of a replica as being a 
shard, but

brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- brFrom: Darren Govoni
brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the 
fact that the
brbrcollection may be sharded, with each shard on one or more 
cores, with
breach brcore being a replica of the other cores within that 
shard of that

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. 
A shard
britself is not distributed across cores in the same since. Rather 
a shard
brexist on a single core and is replicated on other cores. Is that 
right? The

brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a 
machine in a

brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have 
multiple
brbrvirtual nodes on the same physical box. Each Solr replica 
would be on

bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running 
on a single
brbrhardware node, each with a different port. They are simply 
instances of
brbrSolr, although you could consider each Solr instance a node 
in a Solr

brcloud
brbras well, a virtual node. So, technically, you could have 
multiple

brreplicas
brbron the same node, but that sort of defeats most of the 
purpose of having
brbrreplicas in the first place - to distribute the data for 
performance and
brbrfault tolerance. But, you could have replicas of different 
shards on the

brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- brbrFrom: Darren Govoni
brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be 
labeled

brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message

RE: Does SolrCloud supports MoreLikeThis?

2012-11-05 Thread Darren Govoni


There is a ticket for that with some recent activity (sorry I don't have it 
handy right now), but I'm not sure if that work made it into the trunk, so 
probably solrcloud does not support MLT...yet. Would love an update from the 
dev team though!

brbrbr--- Original Message ---
On 11/5/2012  10:37 AM Luis Cappa Banda wrote:brThat´s the question, :-)
br
brRegards,
br
brLuis Cappa.
br

Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download

2012-10-29 Thread Darren Govoni

It certainly seems to be a rogue project, but I can't understand the 
meaning of realtime near realtime (NRT) either. At best, its oxymoronic.



On 10/29/2012 10:30 AM, Jack Krupansky wrote:
Could any of the committers here confirm whether this is a legitimate 
effort? I mean, how could anything labeled Apache ABC with XYZ be an 
external project and be sanctioned/licensed by Apache? In fact, the 
linked web page doesn't even acknowledge the ownership of the Apache 
trademarks or ASL. And the term Realtime NRT is nonsensical. Even 
worse: Realtime NRT makes available a near realtime view. Equally 
nonsensical. Who knows, maybe it is legit, but it sure comes across as 
a scam/spam.


-- Jack Krupansky

-Original Message- From: Nagendra Nagarajayya
Sent: Monday, October 29, 2012 10:06 AM
To: solr-user@lucene.apache.org
Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and 
Realtime NRT available for download


Hi!

I am very excited to announce the availability of Apache Solr 4.0 with
RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high
performance and more granular NRT implementation as to soft commit. The
update performance is about 70,000 documents / sec* (almost 1.5-2x
performance improvement over soft-commit). You can also scale up to 2
billion documents* in a single core, and query half a billion documents
index in ms**. Realtime NRT is different from realtime-get. realtime-get
does not have search capability and is a lookup by id. Realtime NRT
allows full search, see here http://solr-ra.tgels.org/realtime-nrt.jsp
for more info.

Realtime NRT has been contributed back to Solr, see JIRA:
https://issues.apache.org/jira/browse/SOLR-3816

RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or
boolean/dismax/boost queries and is compatible with the new Lucene 4.0 
api.


You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4
and Realtime NRT performance from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.

Note:
1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

* performance is a real use case of Apache Solr with RankingAlgorithm as
seen at a user installation
** performance seen when using the age feature

Re: Cloud terminology clarification

2012-09-09 Thread Darren Govoni

I agree it needs updating and I've always gotten confused at some point
by
the use (misuse) of terms.

For example, the term 'node' is thrown around a lot too. What is it??!
Hehe.

On Sat, 2012-09-08 at 22:26 -0700, JesseBuesking wrote:

 It's been a while since the terminology at
 http://wiki.apache.org/solr/SolrTerminology has been updated, so I'm
 wondering how these terms apply to solr cloud setups.
 
 My take on what the terms mean:
 
 Collection: Basically the highest level container that bundles together the
 other pieces for servicing a particular search setup
 Core: An individual solr instance (represents entire indexes)
 Shard: A portion of a core (represents a subset of an index)
 
 Therefore:
 - increasing the number of shards allows for indexing more documents (aka
 scaling the amount of data that can be indexed)
 - increasing the number of cores increases the potential throughput of
 requests (aka cores mirror each other allowing you to distribute requests to
 multiple servers)
 
 Does this sound right?
 
 If so, then my follow up question would be does the following directory
 structure look right/standard?
 
 .../solr # = solr home
 .../solr/collection-01
 .../solr/collection-01/core-01
 .../solr/collection-01/core-02
 
 And if this is right, I'm on a roll :D
 
 My next question would then be:
 Given we're using zookeeper (separate machine), do we need 1 conf folder at
 collection-01's level?  Or do we need 1 conf folder per core?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Cloud-terminology-clarification-tp4006407.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Darren Govoni

Of course you can do it, but the question is whether this will produce
the performance results you expect.
I've seen talk about this in other forums, so you might find some prior
work here.

Solr and HDFS serve somewhat different purposes. The key issue would be
if your map and reduce code
overloads the Solr endpoint. Even using SolrCloud, I believe all
requests will have to go through a single
URL (to be routed), so if you have thousands of map/reduce jobs all
running simultaneously, the question is whether
your Solr is architected to handle that amount of throughput.


On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:

 Is it possible to run map reduce jobs directly on Solr4?
 
 I'm asking this because I want to use Solr4 as the primary storage engine.
 And I want to be able to run near real time analytics against it as well.
 Rather than export solr4 data out to a hadoop cluster.

Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Darren Govoni

You raise an interesting possibility. A map/reduce solr handler over
solrcloud...

On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:

 I think the performance should be close to Hadoop running on HDFS, if
 somehow Hadoop job can directly read the Solr Index file while executing
 the job on the local solr node.
 
 Kindna like how HBase and Cassadra integrate with Hadoop.
 
 Plus, we can run the map reduce job on a standby Solr4 cluster.
 
 This way, the documents in Solr will be our primary source of truth. And we
 have the ability to run near real time search queries and analytics on it.
 No need to export data around.
 
 Solr4 is becoming a very interesting solution to many web scale problems.
 Just missing the map/reduce component. :)
 
 On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote:
 
  Of course you can do it, but the question is whether this will produce
  the performance results you expect.
  I've seen talk about this in other forums, so you might find some prior
  work here.
 
  Solr and HDFS serve somewhat different purposes. The key issue would be
  if your map and reduce code
  overloads the Solr endpoint. Even using SolrCloud, I believe all
  requests will have to go through a single
  URL (to be routed), so if you have thousands of map/reduce jobs all
  running simultaneously, the question is whether
  your Solr is architected to handle that amount of throughput.
 
 
  On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
 
   Is it possible to run map reduce jobs directly on Solr4?
  
   I'm asking this because I want to use Solr4 as the primary storage
  engine.
   And I want to be able to run near real time analytics against it as well.
   Rather than export solr4 data out to a hadoop cluster.

Re: [Announce] Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT available for download

2012-07-22 Thread Darren Govoni

What exactly is Realtime NRT (Near Real Time)?

On Sun, 2012-07-22 at 14:07 -0700, Nagendra Nagarajayya wrote:

 Hi!
 
 I am very excited to announce the availability of Solr 4.0-ALPHA with 
 RankingAlgorithm 1.4.4 with Realtime NRT. The Realtime NRT 
 implementation now supports both RankingAlgorithm and Lucene. Realtime 
 NRT is a high performance and more granular NRT implementation as to 
 soft commit. The update performance is about 70,000 documents / sec*. 
 You can also scale up to 2 billion documents* in a single core, and 
 query half a billion documents index in ms**.
 
 RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or 
 boolean queries and is compatible with the new Lucene 4.0-ALPHA api.
 
 You can get more information about Solr 4.0-ALPHA with RankingAlgorithm 
 1.4.4 Realtime performance from here:
 http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x
 
 You can download Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 from here:
 http://solr-ra.tgels.org
 
 Please download and give the new version a try.
 
 Regards,
 
 Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org
 
 * performance seen at a user installation of Solr 4.0 with 
 RankingAlgorithm 1.4.3
 ** performance seen when using the age feature

Re: Facet on all the dynamic fields with *_s feature

2012-07-16 Thread Darren Govoni

You'll have to query the index for the fields and sift out the _s ones
and cache them or something.

On Mon, 2012-07-16 at 16:52 +0530, Rajani Maski wrote:

 Yes, This feature will solve the below problem very neatly.
 
 All,
 
  Is there any approach to achieve this for now?
 
 
 --Rajani
 
 On Sun, Jul 15, 2012 at 6:02 PM, Jack Krupansky 
 j...@basetechnology.comwrote:
 
  The answer appears to be No, but it's good to hear people express an
  interest in proposed features.
 
  -- Jack Krupansky
 
  -Original Message- From: Rajani Maski
  Sent: Sunday, July 15, 2012 12:02 AM
  To: solr-user@lucene.apache.org
  Subject: Facet on all the dynamic fields with *_s feature
 
 
  Hi All,
 
Is this issue fixed in solr 3.6 or 4.0:  Faceting on all Dynamic field
  with facet.field=*_s
 
Link  :  
  https://issues.apache.org/**jira/browse/SOLR-247https://issues.apache.org/jira/browse/SOLR-247
 
 
 
   If it is not fixed, any suggestion on how do I achieve this?
 
 
  My requirement is just same as this one :
  http://lucene.472066.n3.**nabble.com/Dynamic-facet-**
  field-tc2979407.html#nonehttp://lucene.472066.n3.nabble.com/Dynamic-facet-field-tc2979407.html#none
 
 
  Regards
  Rajani

Re: Solr Faceting

2012-07-07 Thread Darren Govoni

I don't think it comes at any added cost for solr to return that facet
so you can filter it
out in your business logic.

On Sat, 2012-07-07 at 15:18 +0530, Shanu Jha wrote:

 Hi,
 
 
 I am generating facet for a field which has one of the value NA and I
 want solr should not create facet(or ignore) for this(NA) value. Is there
 any way to in solr to do that.
 
 Thanks

Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-28 Thread Darren Govoni

I don't recall anyone being able to get acceptable performance with a
single index that large with solr/lucene. The conventional wisdom is
that parallel searching across cores (or shards in SolrCloud) is the
best way to handle index sizes in the illions. So its of great
interest how you did.

Anyone else gotten an index(es) with billions of documents to perform
well? I'm greatly interested in how.

On Mon, 2012-05-28 at 05:12 -0700, Nagendra Nagarajayya wrote:
 It is a single node. I am trying to find out if the performance can be 
 referenced.
 
 Regarding information on Solr with RankingAlgorithm, you can find all 
 the information here:
 
 http://solr-ra.tgels.org
 
 On RankingAlgorithm:
 
 http://rankingalgorithm.tgels.org
 
 Regards,
 - NN
 
 On 5/27/2012 4:50 PM, Li Li wrote:
  yes, I am also interested in good performance with 2 billion docs. how
  many search nodes do you use? what's the average response time and qps
  ?
 
  another question: where can I find related paper or resources of your
  algorithm which explains the algorithm in detail? why it's better than
  google site(better than lucene is not very interested because lucene
  is not originally designed to provide search function like google)?
 
  On Mon, May 28, 2012 at 1:06 AM, Darren Govonidar...@ontrenet.com  wrote:
  I think people on this list would be more interested in your approach to
  scaling 2 billion documents than modifying solr/lucene scoring (which is
  already top notch). So given that, can you share any references or
  otherwise substantiate good performance with 2 billion documents?
 
  Thanks.
 
  On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote:
  Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion
  docs. With RankingAlgorithm 1.4.3, using the parameters
  age=latestdocs=number feature, you can retrieve the NRT inserted
  documents in milliseconds from such a huge index improving query and
  faceting performance and using very little resources ...
 
  Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and
  the NRT insert performance with Solr 4.0 is about 70,000 docs / sec.
  RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
 
 
  On 5/27/2012 7:32 AM, Darren Govoni wrote:
  Hi,
  Have you tested this with a billion documents?
 
  Darren
 
  On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
  Hi!
 
  I am very excited to announce the availability of Solr 3.6 with
  RankingAlgorithm 1.4.2.
 
  This NRT supports now works with both RankingAlgorithm and Lucene. The
  insert/update performance should be about 5000 docs in about 490 ms with
  the MbArtists Index.
 
  RankingAlgorithm 1.4.2 has multiple algorithms, improved performance
  over the earlier releases, supports the entire Lucene Query Syntax, ±
  and/or boolean queries and can scale to more than a billion documents.
 
  You can get more information about NRT performance from here:
  http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
 
  You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
  http://solr-ra.tgels.org
 
  Please download and give the new version a try.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
  ps. MbArtists index is the example index used in the Solr 1.4 Enterprise
  Book

Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-27 Thread Darren Govoni

Hi,
  Have you tested this with a billion documents?

Darren

On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
 Hi!
 
 I am very excited to announce the availability of Solr 3.6 with 
 RankingAlgorithm 1.4.2.
 
 This NRT supports now works with both RankingAlgorithm and Lucene. The 
 insert/update performance should be about 5000 docs in about 490 ms with 
 the MbArtists Index.
 
 RankingAlgorithm 1.4.2 has multiple algorithms, improved performance 
 over the earlier releases, supports the entire Lucene Query Syntax, ± 
 and/or boolean queries and can scale to more than a billion documents.
 
 You can get more information about NRT performance from here:
 http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
 
 You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
 http://solr-ra.tgels.org
 
 Please download and give the new version a try.
 
 Regards,
 
 Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org
 
 ps. MbArtists index is the example index used in the Solr 1.4 Enterprise 
 Book

Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-27 Thread Darren Govoni

I think people on this list would be more interested in your approach to
scaling 2 billion documents than modifying solr/lucene scoring (which is
already top notch). So given that, can you share any references or
otherwise substantiate good performance with 2 billion documents?

Thanks.

On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote:
 Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion 
 docs. With RankingAlgorithm 1.4.3, using the parameters 
 age=latestdocs=number feature, you can retrieve the NRT inserted 
 documents in milliseconds from such a huge index improving query and 
 faceting performance and using very little resources ...
 
 Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and 
 the NRT insert performance with Solr 4.0 is about 70,000 docs / sec. 
 RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon.
 
 Regards,
 
 Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org
 
 
 
 On 5/27/2012 7:32 AM, Darren Govoni wrote:
  Hi,
 Have you tested this with a billion documents?
 
  Darren
 
  On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
  Hi!
 
  I am very excited to announce the availability of Solr 3.6 with
  RankingAlgorithm 1.4.2.
 
  This NRT supports now works with both RankingAlgorithm and Lucene. The
  insert/update performance should be about 5000 docs in about 490 ms with
  the MbArtists Index.
 
  RankingAlgorithm 1.4.2 has multiple algorithms, improved performance
  over the earlier releases, supports the entire Lucene Query Syntax, ±
  and/or boolean queries and can scale to more than a billion documents.
 
  You can get more information about NRT performance from here:
  http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
 
  You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
  http://solr-ra.tgels.org
 
  Please download and give the new version a try.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
  ps. MbArtists index is the example index used in the Solr 1.4 Enterprise
  Book

SolrCloud war context name?

2012-05-26 Thread Darren Govoni

Hi,
 I am running my solrcloud nodes in an app server deployed into the
context path 'solr' and zookeeper sees all of them. I want to deploy a
second solrcloud war into the same app server (thus same IP:port) in a
different context like 'solrrep' with the same config (cloned).

Will this work? Or does zookeeper (or solrcloud leader) require all
connected solr shards to have context url with ip:port/solr? Or will the
correct URL be registered from the replica shard?

thanks!

Re: SolrCloud war context name?

2012-05-26 Thread Darren Govoni

It's not really clear from the wiki how to use cores as shard replicas
within the same solr server. In my mind, having a separate JVM/solr
node/ acting as a replica makes sense because the replication traffic
will be on a different channel in a different vm and won't interfere
with search/indexing traffic on the primary shards.

Or am I missing something easy about using cores with solr cloud? 
It was mentioned on the list recently that managing cores with solrcloud
isn't really the best practice for it.

On Sat, 2012-05-26 at 16:12 -0300, Marcelo Carvalho Fernandes wrote:
 Why not using multicore?
 
 
 Marcelo Carvalho Fernandes
 +55 21 8272-7970
 
 
 
 On Sat, May 26, 2012 at 12:56 PM, Darren Govoni ontre...@ontrenet.comwrote:
 
  Hi,
   I am running my solrcloud nodes in an app server deployed into the
  context path 'solr' and zookeeper sees all of them. I want to deploy a
  second solrcloud war into the same app server (thus same IP:port) in a
  different context like 'solrrep' with the same config (cloned).
 
  Will this work? Or does zookeeper (or solrcloud leader) require all
  connected solr shards to have context url with ip:port/solr? Or will the
  correct URL be registered from the replica shard?
 
  thanks!

RE: Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-22 Thread Darren Govoni


I'm curious what the solrcloud experts say, but my suggestion is to try not to 
over-engineering the search architecture  on solrcloud. For example, what is 
the benefit of managing the what cores are indexed and searched? Having to know 
those details, in my mind, works against the automation in solrcore, but maybe 
there's a good reason you want to do it this way.

brbrbr--- Original Message ---
On 5/22/2012  07:35 AM Yandong Yao wrote:brHi Darren,
br
brThanks very much for your reply.
br
brThe reason I want to control core indexing/searching is that I want to
bruse one core to store one customer's data (all customer share same
brconfig):  such as customer 1 use coreForCustomer1 and customer 2
bruse coreForCustomer2.
br
brIs there any better way than using different core for different customer?
br
brAnother way maybe use different collection for different customer, while
brnot sure how many collections solr cloud could support. Which way is better
brin terms of flexibility/scalability? (suppose there are tens of thousands
brcustomers).
br
brRegards,
brYandong
br
br2012/5/22 Darren Govoni dar...@ontrenet.com
br
br Why do you want to control what gets indexed into a core and then
br knowing what core to search? That's the kind of knowing that SolrCloud
br solves. In SolrCloud, it handles the distribution of documents across
br shards and retrieves them regardless of which node is searched from.
br That is the point of cloud, you don't know the details of where
br exactly documents are being managed (i.e. they are cloudy). It can
br change and re-balance from time to time. SolrCloud performs the
br distributed search for you, therefore when you try to search a node/core
br with no documents, all the results from the cloud are retrieved
br regardless. This is considered A Good Thing.
br
br It requires a change in thinking about indexing and searching
br
br On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
br  Hi Guys,
br 
br  I use following command to start solr cloud according to solr cloud 
wiki.
br 
br  yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
br  -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
br  yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983
br -jar
br  start.jar
br 
br  Then I have created several cores using CoreAdmin API (
br  http://localhost:8983/solr/admin/cores?action=CREATEname=
br  coreNamecollection=collection1), and clusterstate.json show following
br  topology:
br 
br 
br  collection1:
br  -- shard1:
br-- collection1
br-- CoreForCustomer1
br-- CoreForCustomer3
br-- CoreForCustomer5
br  -- shard2:
br-- collection1
br-- CoreForCustomer2
br-- CoreForCustomer4
br 
br 
br  1) Index:
br 
br  Using following command to index mem.xml file in exampledocs directory.
br 
br  yydzero:exampledocs bjcoe$ java -Durl=
br  http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
br  SimplePostTool: version 1.4
br  SimplePostTool: POSTing files to
br  http://localhost:8983/solr/coreForCustomer3/update..
br  SimplePostTool: POSTing file mem.xml
br  SimplePostTool: COMMITting Solr index changes.
br 
br  And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
br  'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
br  core has 0 documents.
br 
br  *Question 1:*  Is this expected behavior? How do I to index documents
br into
br  a specific core?
br 
br  *Question 2*:  If SolrCloud don't support this yet, how could I extend 
it
br  to support this feature (index document to particular core), where
br should i
br  start, the hashing algorithm?
br 
br  *Question 3*:  Why the documents are also indexed into 
'coreForCustomer1'
br  and 'coreForCustomer5'?  The default replica for documents are 1, right?
br 
br  Then I try to index some document to 'coreForCustomer2':
br 
br  $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
br  post.jar ipod_video.xml
br 
br  While 'coreForCustomer2' still have 0 documents and documents in
br ipod_video
br  are indexed to core for customer 1/3/5.
br 
br  *Question 4*:  Why this happens?
br 
br  2) Search: I use 
br  http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to
br  search against 'CoreForCustomer2', while it will return all documents in
br  the whole collection even though this core has no documents at all.
br 
br  Then I use 
br 
br 
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2
br ,
br  and it will return 0 documents.
br 
br  *Question 5*: So If want to search against a particular core, we need to
br  use 'shards' parameter and use solrCore name as parameter value, right?
br 
br 
br  Thanks very much in advance!
br 
br  Regards,
br  Yandong
br
br
br
br

Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-21 Thread Darren Govoni

Why do you want to control what gets indexed into a core and then
knowing what core to search? That's the kind of knowing that SolrCloud
solves. In SolrCloud, it handles the distribution of documents across
shards and retrieves them regardless of which node is searched from.
That is the point of cloud, you don't know the details of where
exactly documents are being managed (i.e. they are cloudy). It can
change and re-balance from time to time. SolrCloud performs the
distributed search for you, therefore when you try to search a node/core
with no documents, all the results from the cloud are retrieved
regardless. This is considered A Good Thing.

It requires a change in thinking about indexing and searching

On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
Hi Guys,

I use following command to start solr cloud according to solr cloud wiki.

yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
-Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 -jar
start.jar

Then I have created several cores using CoreAdmin API (
http://localhost:8983/solr/admin/cores?action=CREATEname=
coreNamecollection=collection1), and clusterstate.json show following
topology:

collection1:
-- shard1:
-- collection1
-- CoreForCustomer1
-- CoreForCustomer3
-- CoreForCustomer5
-- shard2:
-- collection1
-- CoreForCustomer2
-- CoreForCustomer4

1) Index:

Using following command to index mem.xml file in exampledocs directory.

yydzero:exampledocs bjcoe$ java -Durl=
http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
SimplePostTool: version 1.4
SimplePostTool: POSTing files to
http://localhost:8983/solr/coreForCustomer3/update..
SimplePostTool: POSTing file mem.xml
SimplePostTool: COMMITting Solr index changes.

And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
core has 0 documents.

*Question 1:* Is this expected behavior? How do I to index documents into
a specific core?

*Question 2*: If SolrCloud don't support this yet, how could I extend it
to support this feature (index document to particular core), where should i
start, the hashing algorithm?

*Question 3*: Why the documents are also indexed into 'coreForCustomer1'
and 'coreForCustomer5'? The default replica for documents are 1, right?

Then I try to index some document to 'coreForCustomer2':

$ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
post.jar ipod_video.xml

While 'coreForCustomer2' still have 0 documents and documents in ipod_video
are indexed to core for customer 1/3/5.

*Question 4*: Why this happens?

2) Search: I use
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to
search against 'CoreForCustomer2', while it will return all documents in
the whole collection even though this core has no documents at all.

Then I use
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2;,
and it will return 0 documents.

*Question 5*: So If want to search against a particular core, we need to
use 'shards' parameter and use solrCore name as parameter value, right?

Thanks very much in advance!

Regards,
Yandong

Re: Distributed search between solrclouds?

2012-05-18 Thread Darren Govoni

The thought here is to distribute a search between two different
solrcloud clusters and get ordered ranked results between them.
It's possible?

On Tue, 2012-05-15 at 18:47 -0400, Darren Govoni wrote:
 Hi,
   Would distributed search (the old way where you provide the solr host
 IP's etc.) still work between different solrclouds?
 
 thanks,
 Darren

Distributed search between solrclouds?

2012-05-15 Thread Darren Govoni

Hi,
  Would distributed search (the old way where you provide the solr host
IP's etc.) still work between different solrclouds?

thanks,
Darren

Re: Documents With large number of fields

2012-05-13 Thread Darren Govoni

Was there a response to this? 

On Fri, 2012-05-04 at 10:27 -0400, Keswani, Nitin - BLS CTR wrote:
 Hi,
 
 My data model consist of different types of data. Each data type has its own 
 characteristics
 
 If I include the unique characteristics of each type of data, my single Solr 
 Document could end up containing 300-400 fields.
 
 In order to drill down to this data set I would have to provide faceting on 
 most of these fields so that I can drilldown to very small set of
 Documents.
 
 Here are some of the questions :
 
 1) What's the best approach when dealing with documents with large number of 
 fields .
 Should I keep a single document with large number of fields or split my
 document into a number of smaller  documents where each document would 
 consist of some fields
 
 2) From an operational point of view, what's the drawback of having a single 
 document with a very large number of fields.
 Can Solr support documents with large number of fields (say 300 to 400).
 
 
 Thanks.
 
 Regards,
 
 Nitin Keswani

Re: Documents With large number of fields

2012-05-04 Thread Darren Govoni

I'm also interested in this. Same situation.

On Fri, 2012-05-04 at 10:27 -0400, Keswani, Nitin - BLS CTR wrote:
 Hi,
 
 My data model consist of different types of data. Each data type has its own 
 characteristics
 
 If I include the unique characteristics of each type of data, my single Solr 
 Document could end up containing 300-400 fields.
 
 In order to drill down to this data set I would have to provide faceting on 
 most of these fields so that I can drilldown to very small set of
 Documents.
 
 Here are some of the questions :
 
 1) What's the best approach when dealing with documents with large number of 
 fields .
 Should I keep a single document with large number of fields or split my
 document into a number of smaller  documents where each document would 
 consist of some fields
 
 2) From an operational point of view, what's the drawback of having a single 
 document with a very large number of fields.
 Can Solr support documents with large number of fields (say 300 to 400).
 
 
 Thanks.
 
 Regards,
 
 Nitin Keswani

SolrCloud indexing question

2012-04-20 Thread Darren Govoni

Hi,
  I just wanted to make sure I understand how distributed indexing works
in solrcloud.

Can I index locally at each shard to avoid throttling a central port? Or
all the indexing has to go through a single shard leader?

thanks

Re: SolrCloud indexing question

2012-04-20 Thread Darren Govoni

Gotcha.

Now does that mean if I have 5 threads all writing to a local shard,
will that shard piggyhop those index requests onto a SINGLE connection
to the leader? Or will they spawn 5 connections from the shard to the
leader? I really hope the formerthe latter won't scale well.

On Fri, 2012-04-20 at 10:28 -0400, Jamie Johnson wrote:
 my understanding is that you can send your updates/deletes to any
 shard and they will be forwarded to the leader automatically.  That
 being said your leader will always be the place where the index
 happens and then distributed to the other replicas.
 
 On Fri, Apr 20, 2012 at 7:54 AM, Darren Govoni dar...@ontrenet.com wrote:
  Hi,
   I just wanted to make sure I understand how distributed indexing works
  in solrcloud.
 
  Can I index locally at each shard to avoid throttling a central port? Or
  all the indexing has to go through a single shard leader?
 
  thanks

Re: Opposite to MoreLikeThis?

2012-04-20 Thread Darren Govoni

You could run the MLT for the document in question, then gather all
those doc id's in the MLT results and negate those in a subsequent
query. Not sure how robust that would work with very large result sets,
but something to try.

Another approach would be to gather the interesting terms from the
document in question and then negate those terms in subsequent queries.
Perhaps with many negated terms, Solr will rank the results based on
most negated terms above less negated terms, simulating a ranked less
like effect.

On Fri, 2012-04-20 at 15:38 -0700, Charlie Maroto wrote:
 Hi all,
 
 Is there a way to implement the opposite to MoreLikeThis (LessLikeThis, I
 guess :).  The requirement we have is to remove all documents with content
 like that of a given document id or a text provided by the end-user.  In
 the current index implementation (not using Solr), the user can narrow
 results by indicating what document(s) are not relevant to him and then
 request to remove from the search results any document whose content is
 like that of the selected document(s)
 
 Our index has close to 100 million documents and they cover multiple topics
 that are not related to one another.  So, a search for some broad terms may
 retrieve documents about engineering, agriculture, communications, etc.  As
 the user is trying to discover the relevant documents, he may select an
 agriculture-related document to exclude it and those documents like it from
 the results set; same w/ engineering-like content, etc. until most of the
 documents are about communications.
 
 Of course, some exclusions may actually remove relevant content but those
 filters can be removed to go back to the previous set of results.
 
 Any ideas from similar implementations or suggestions are welcomed!
 Thanks,
 Carlos

Re: hierarchical faceting?

2012-04-18 Thread Darren Govoni

Put the parent term in all the child documents at index time
and the re-issue the facet query when you expand the parent using the
parent's term. works perfect.

On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote:
 I have hierarchical colors:
 field name=colors type=text_pathindexed=true
 stored=true multiValued=true/
 text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.
 
 Given these two documents,
 Doc1: red
 Doc2: red/pink
 
 I want the result to be the following:
 ?fq=red
 == Doc1, Doc2
 
 ?fq=red/pink
 == Doc2
 
 But, with PathHierarchyTokenizer, Doc1 is included for the query:
 ?fq=red/pink
 == Doc1, Doc2
 
 How can I query for hierarchical facets?
 http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix..
 But it looks too cumbersome to me.
 
 Is there a simpler way to implement hierarchical facets?

Re: hierarchical faceting?

2012-04-18 Thread Darren Govoni

I don't use any of that stuff in my app, so not sure how it works.

I just manage my taxonomy outside of solr at index time and don't need
any special fields or tokenizers. I use a string field type and insert
the proper field at index time and query it normally. Nothing special
required.

On Wed, 2012-04-18 at 13:00 -0400, sam ” wrote:
 It looks like TextField is the problem.
 
 This fixed:
 fieldType name=text_path class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
   tokenizer class=solr.PathHierarchyTokenizerFactory
 delimiter=//
   /analyzer
   analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   /analyzer
 /fieldType
 
 I am assuming the text_path fields won't include whitespace characters.
 
 ?q=colors:red/pink
 == Doc2   (Doc1, which has colors = red isn't included!)
 
 
 Is there a tokenizer that tokenizes the string as one token?
 I tried to extend Tokenizer myself  but it fails:
 public class AsIsTokenizer extends Tokenizer {
 @Override
 public boolean incrementToken() throws IOException {
 return true;//or false;
 }
 }
 
 
 On Wed, Apr 18, 2012 at 11:33 AM, sam ” skyn...@gmail.com wrote:
 
  Yah, that's exactly what PathHierarchyTokenizer does.
  fieldType name=text_path class=solr.TextField
  positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.PathHierarchyTokenizerFactory/
/analyzer
  /fieldType
 
  I think I have a query time tokenizer that tokenizes at /
 
  ?q=colors:red
  == Doc1, Doc2
 
  ?q=colors:redfoobar
  ==
 
  ?q=colors:red/foobarasdfoaijao
  == Doc1, Doc2
 
 
 
 
  On Wed, Apr 18, 2012 at 11:10 AM, Darren Govoni dar...@ontrenet.comwrote:
 
  Put the parent term in all the child documents at index time
  and the re-issue the facet query when you expand the parent using the
  parent's term. works perfect.
 
  On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote:
   I have hierarchical colors:
   field name=colors type=text_pathindexed=true
   stored=true multiValued=true/
   text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.
  
   Given these two documents,
   Doc1: red
   Doc2: red/pink
  
   I want the result to be the following:
   ?fq=red
   == Doc1, Doc2
  
   ?fq=red/pink
   == Doc2
  
   But, with PathHierarchyTokenizer, Doc1 is included for the query:
   ?fq=red/pink
   == Doc1, Doc2
  
   How can I query for hierarchical facets?
   http://wiki.apache.org/solr/HierarchicalFaceting describes
  facet.prefix..
   But it looks too cumbersome to me.
  
   Is there a simpler way to implement hierarchical facets?

Re: Monitoring SolrCloud health

2012-04-14 Thread Darren Govoni

Can you be more specific about health?

On Sat, 2012-04-14 at 00:03 -0400, Jamie Johnson wrote:
 How do people currently monitor the health of a solr cluster?  Are
 there any good tools which can show the health across the entire
 cluster?  Is this something which is planned for the new admin user
 interface?

RE: Realtime /get versus SearchHandler

2012-04-13 Thread Darren Govoni


Yes

brbrbr--- Original Message ---
On 4/13/2012  06:25 AM Benson Margulies wrote:brA discussion over on the dev 
list led me to expect that the by-if
brfield retrievals in a SolrCloud query would come through the get
brhandler. In fact, I've seen them turn up in my search component in the
brsearch handler that is configured with my custom QT. (I have a
br'prepare' method that sets ShardParams.QT to my QT to get my
brprocessing involved in the first of the two queries.) Did I overthink
brthis?
br
br

Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni

You could use SolrCloud (for the automatic scaling) and just mount a
fuse[1] HDFS directory and configure solr to use that directory for its
data. 

[1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS

On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
 Hi,
 
 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.
 
 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.
 
 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).
 
 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.
 
 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?
 
 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).
 
 Many many thanks in advance.
 
 Thanks,
 Safdar

RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni


Solrcloud or any other tech specific replication isnt going to 'just work' with 
hadoop replication. But with some significant custom coding anything should be 
possible. Interesting idea.

brbrbr--- Original Message ---
On 4/12/2012  09:21 AM Ali S Kureishy wrote:brThanks Darren.
br
brActually, I would like the system to be homogenous - i.e., use Hadoop based
brtools that already provide all the necessary scaling for the lucene index
br(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
brits own layer of sharding/replication that is outside Hadoop, I feel that
brusing SolrCloud would be redundant, and a step in the opposite
brdirection, which is what I'm trying to avoid in the first place. Or am I
brmistaken?
br
brThanks,
brSafdar
br
br
brOn Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote:
br
br You could use SolrCloud (for the automatic scaling) and just mount a
br fuse[1] HDFS directory and configure solr to use that directory for its
br data.
br
br [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
br
br On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
br  Hi,
br 
br  I'm trying to setup a large scale *Crawl + Index + Search 
*infrastructure
br  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web 
pages*,
br  crawled + indexed every *4 weeks, *with a search latency of less than 
0.5
br  seconds.
br 
br  Needless to mention, the search index needs to scale to 5Billion pages.
br It
br  is also possible that I might need to store multiple indexes -- one for
br  crawled content, and one for ancillary data that is also very large. 
Each
br  of these indices would likely require a logically distributed and
br  replicated index.
br 
br  However, I would like for such a system to be homogenous with the Hadoop
br  infrastructure that is already installed on the cluster (for the crawl).
br In
br  other words, I would much prefer if the replication and distribution of
br the
br  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead 
of
br  using another scalability framework (such as SolrCloud). In addition, it
br  would be ideal if this environment was flexible enough to be dynamically
br  scaled based on the size requirements of the index and the search 
traffic
br  at the time (i.e. if it is deployed on an Amazon cluster, it should be
br easy
br  enough to automatically provision additional processing power into the
br  cluster without requiring server re-starts).
br 
br  However, I'm not sure which Solr-based tool in the Hadoop ecosystem 
would
br  be ideal for this scenario. I've heard mention of Solr-on-HBase,
br Solandra,
br  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
br is
br  mature enough and would be the right architectural choice to go along
br with
br  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
br aspects
br  above.
br 
br  Lastly, how much hardware (assuming a medium sized EC2 instance) would
br you
br  estimate my needing with this setup, for regular web-data (HTML text) at
br  this scale?
br 
br  Any architectural guidance would be greatly appreciated. The more 
details
br  provided, the wider my grin :).
br 
br  Many many thanks in advance.
br 
br  Thanks,
br  Safdar
br
br
br
br

Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-11 Thread Darren Govoni

Hard to say why its not working for you. Start with a fresh Solr and
work forward from there or back out your configs and plugins until it
works again.

On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
 In my cloud configuration, if I push
 
 delete
   query*:*/query
 /delete
 
 followed by:
 
 commit/
 
 I get no errors, the log looks happy enough, but the documents remain
 in the index, visible to /query.
 
 Here's what seems my relevant bit of solrconfig.xml. My URP only
 implements processAdd.
 
updateRequestProcessorChain name=RNI
 !-- some day, add parameters when we have some --
 processor 
 class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.DistributedUpdateProcessorFactory/
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain
 
 !-- activate RNI processing by adding the RNI URP to the chain
 for xml updates --
   requestHandler name=/update
   class=solr.XmlUpdateRequestHandler
 lst name=defaults
   str name=update.chainRNI/str
 /lst
 /requestHandler

RE: SOLR issue - too many search queries

2012-04-10 Thread Darren Govoni

My first reaction to your question is why are you running thousands of queries
in a loop? Immediately, I think this will not scale well and the design
probably needs to be re-visited.

Second, if you need that many requests, then you need to seriously consider an
architecture that supports it. This will require a complex design involving
load balancers, multiple servers, replication, etc. People have achieved this
with Solr, but it's beyond the scope of Solr itself to provide this, as its a
matter of system architecture.

Also, there are limits to the number of app server threads allowed, OS threads
allowed, OS sockets, OS file descriptors, etc. etc. All of which need to be
understood, designed for and configured properly.

brbrbr--- Original Message ---
On 4/10/2012 07:51 AM arunssasidhar wrote:brWe have a PHP web application
which is using SOLR for searching. The APP is
brusing CURL to connect to the SOLR server and which run in a loop with
brthousands of predefined keywords. That will create thousands of different
brsearch quires to SOLR at a given time.
br
brMy issue is that, when a single user logged into the app everything is
brworking as expected. When there is more than one user is trying to run the
brapp we are getting this response from the server.
br
brFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested
braddressFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested
braddressFailed
br
brOur assumption is that, SOLR server is unable to handle this much search
brqueries at a given time. If so what is the solution to overcome this?. Is
brthere any settings like keep-alive in SOLR?
br
brAny help would be highly appreciate.
br
brThanks,
br
brArun S
br
br
br--
brView this message in context:
http://lucene.472066.n3.nabble.com/SOLR-issue-too-many-search-queries-tp3899518p3899518.html
brSent from the Solr - User mailing list archive at Nabble.com.
br
br

RE: Re: Cloud-aware request processing?

2012-04-09 Thread Darren Govoni


...it is a distributed real-time query scheme...

SolrCloud does this already. It treats all the shards like one-big-index, and you can 
query it normally to get subset results from each shard. Why do you have to 
re-write the query for each shard? Seems unnecessary.

brbrbr--- Original Message ---
On 4/9/2012  08:45 AM Benson Margulies wrote:br Jan Høydahl,
br
brMy problem is intimately connected to Solr. it is not a batch job for
brhadoop, it is a distributed real-time query scheme. I hate to add yet
branother complex framework if a Solr RP can do the job simply.
br
brFor this problem, I can transform a Solr query into a subset query on
breach shard, and then let the SolrCloud mechanism.
br
brI am well aware of the 'zoo' of alternatives, and I will be evaluating
brthem if I can't get what I want from Solr.
br
brOn Mon, Apr 9, 2012 at 9:34 AM, Jan Høydahl jan@cominvent.com wrote:
br Hi,
br
br Instead of using Solr, you may want to have a look at Hadoop or another 
framework for distributed computation, see e.g. 
http://java.dzone.com/articles/comparison-gridcloud-computing
br
br --
br Jan Høydahl, search solution architect
br Cominvent AS - www.cominvent.com
br Solr Training - www.solrtraining.com
br
br On 9. apr. 2012, at 13:41, Benson Margulies wrote:
br
br I'm working on a prototype of a scheme that uses SolrCloud to, in
br effect, distribute a computation by running it inside of a request
br processor.
br
br If there are N shards and M operations, I want each node to perform
br M/N operations. That, of course, implies that I know N.
br
br Is that fact available anyplace inside Solr, or do I need to just 
configure it?
br
br
br

Re: How to facet data from a multivalued field?

2012-04-09 Thread Darren Govoni

Your handler for that field should be looked at.
Try not using a handler that tokenizes or stems the field.
You want to leave the text as is. I forget the handler setting for that,
but its documented in there somewhere.

On Mon, 2012-04-09 at 13:02 -0700, Thiago wrote:
Hello everybody,

I've already searched about this topic in the forum, but I didn't find any
case like this. I ask for apologizes if this topic have been already
discussed.

I'm having a problem in faceting a multivalued field. My field is called
series, and it has names of TV series like the big bang theory, two and a
half men ...

In this field I can have a lot of TV series names. For example:

arr name=series
strTwo and a Half Men/str
strHow I Met Your Mother/str
strThe Big Bang Theory/str
/arr

What I want to do is: search and count how many documents related to each
series. I'm doing it using facet search in this field. But it's returning
each word separately. Like this:

lst name=facet_counts
lst name=facet_queries/
lst name=facet_fields
lst name=series
int name=bang91/int
int name=big91/int
int name=half21/int
int name=how45/int
int name=i45/int
int name=men21/int
int name=met45/int
int name=mother45/int
int name=theori91/int
int name=two21/int
int name=your45/int
/lst
/lst
lst name=facet_dates/
lst name=facet_ranges/
/lst

And what I want is something like:

lst name=facet_counts
lst name=facet_queries/
lst name=facet_fields
lst name=series
int name=Two and a Half Men21/int
int name=How I Met Your Mother45/int
int name=The Big Bang Theory91/int
/lst
/lst
lst name=facet_dates/
lst name=facet_ranges/
/lst

Is there any possible way to do it with facet search? I don't want the
terms, I just want each string including the white spaces. Do I have to
change my fieldtype to do this?

Thanks to everybody.

Thiago

--
View this message in context:
http://lucene.472066.n3.nabble.com/How-to-facet-data-from-a-multivalued-field-tp3897853p3897853.html
Sent from the Solr - User mailing list archive at Nabble.com.

No webadmin for trunk?

Hi,
  Just updated solr trunk and tried the java -jar start.jar and
localhost:8983/solr/admin.not found.

Where did it go?

thanks.

Re: No webadmin for trunk?

HTTP ERROR: 404
Problem accessing /solr. Reason:

Not Found



Powered by Jetty://

On Sat, 2012-04-07 at 09:04 -0400, Jamie Johnson wrote:
 just go to localhost:8983/solr and you'll see the updated interface.
 
 On Sat, Apr 7, 2012 at 8:23 AM, Darren Govoni dar...@ontrenet.com wrote:
  Hi,
   Just updated solr trunk and tried the java -jar start.jar and
  localhost:8983/solr/admin.not found.
 
  Where did it go?
 
  thanks.

Re: No webadmin for trunk?

start.jar has no apps in it at all.

On Sat, 2012-04-07 at 09:47 -0400, Darren Govoni wrote:
 HTTP ERROR: 404
 Problem accessing /solr. Reason:
 
 Not Found
 
 
 
 Powered by Jetty://
 
 On Sat, 2012-04-07 at 09:04 -0400, Jamie Johnson wrote:
  just go to localhost:8983/solr and you'll see the updated interface.
  
  On Sat, Apr 7, 2012 at 8:23 AM, Darren Govoni dar...@ontrenet.com wrote:
   Hi,
Just updated solr trunk and tried the java -jar start.jar and
   localhost:8983/solr/admin.not found.
  
   Where did it go?
  
   thanks.

Re: No webadmin for trunk?

Yep. I did all kinds of ant clean, ant dist, ant example, etc.

My trunk rev.

At revision 1310773.

Example start.jar is broken. No webapp inside. :(

On Sat, 2012-04-07 at 16:11 +0200, Rafał Kuć wrote:
 Hello!
 
 Did you run 'ant example' ?

Re: No webadmin for trunk?

K. There is a solr.war in the webapps directory. But still get the 404.

On Sat, 2012-04-07 at 16:19 +0200, Rafał Kuć wrote:
 Hello!
 
 start.jar shouldn't contain any webapp. If you look at the 'example'
 directory, you'll notice that there is a 'webapps' directory which
 should contain solr.war file.
 
 Btw. revision 1307647 works without a problem. I'll checkout trunk in
 a few in try with the newest revision.

Re: No webadmin for trunk?

Now, it comes up. Not sure why its acting weird. Will continue to look
at it.

On Sat, 2012-04-07 at 10:23 -0400, Darren Govoni wrote:
 K. There is a solr.war in the webapps directory. But still get the 404.
 
 On Sat, 2012-04-07 at 16:19 +0200, Rafał Kuć wrote:
  Hello!
  
  start.jar shouldn't contain any webapp. If you look at the 'example'
  directory, you'll notice that there is a 'webapps' directory which
  should contain solr.war file.
  
  Btw. revision 1307647 works without a problem. I'll checkout trunk in
  a few in try with the newest revision.

Re: upgrade 3.5 to 4.0