negative array size exception

2017-03-06 Thread Walker, Darren
After migrating from solr to a load balanced solrcloud with 3 ZKs on the same 
machines and solr has 3 shards (one per node) We see this logged in the UI on 
one of our solrs.
Does anyone know what this is symptomatic of?

java.lang.NegativeArraySizeException
 at org.apache.lucene.util.PriorityQueue.(PriorityQueue.java:63)
 at org.apache.lucene.util.PriorityQueue.(PriorityQueue.java:44)
 at 
org.apache.solr.handler.component.ShardFieldSortedHitQueue.(ShardFieldSortedHitQueue.java:45)
 at 
org.apache.solr.handler.component.QueryComponent.mergeIds(QueryComponent.java:979)
 at 
org.apache.solr.handler.component.QueryComponent.handleRegularResponses(QueryComponent.java:763)
 at 
org.apache.solr.handler.component.QueryComponent.handleResponses(QueryComponent.java:742)
 at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:428)
 at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:166)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:2306)
 at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:658)
 at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:464)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:345)
 at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:296)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1691)
 at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
 at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
 at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
 at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
 at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
 at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
 at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
 at org.eclipse.jetty.server.Server.handle(Server.java:534)
 at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
 at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
 at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
 at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
 at 
org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
 at 
org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
 at java.lang.Thread.run(Thread.java:745)



solr to solrcloud

2017-03-01 Thread Walker, Darren
Our out of the box solr 5.4.1 installation cannot handle the 50gb analytics 
index anymore. We are using sitecore 8.1 and planning to go to 8.2 but when we 
tried went to 8.2 we rebuild the indexes but the sight was very unresponsive 
and was missing items and was too slow. We ended up giving that solr server 
over 92gb of RAM and saw that java.exe was needing about 60gb to process our 
massive index. Even then we couldn't get performance back into the site and 
decided to roll back to 8.1. We looked up options of scaling out horizontally 
because we cannot keep adding RAM to one solr server. To go to solrcloud we 
built 3 ubuntu 14.04.5 servers with a 50gb VM disk for the indexes and the 
other VM disk for the OS, zookeeper, java, tomcat and solr applications. It has 
32GBs of RAM (on all 3 servers). When we move to solrcloud on these servers 
what is the best way to set up the solrcloud environment so they can take the 
data that already exists in our current solr? We have about 16 indexes for 
sitecore with the biggest one being analytics (around 45-50gbs).
Thanks,
Darren Walker



Re: Search opening hours

2015-09-08 Thread Darren Spehr
Sounds odd that the indexing times would change. Hopefully something else
was going on - I've not experienced this.

On Tue, Sep 8, 2015 at 4:31 AM, O. Klein <kl...@octoweb.nl> wrote:

> BTW any idea how index speed is influenced?
>
> I used worldbounds with -1 and 1 y-axes. But figured this could also be 0.
>
> After changing to 0 indexing became a lot slower though (no exceptions in
> log).
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4227531.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Darren


Re: Search opening hours

2015-09-06 Thread Darren Spehr
I think the client code has to normalize the input. There are methods in the 
spatial libraries that will do this - or maybe I wrote them my code, can't 
remember. How are you handling parsing the hours?

- Darren

> On Sep 6, 2015, at 4:56 PM, O. Klein <kl...@octoweb.nl> wrote:
> 
> Saw that, but not a lot of info about it.
> 
> From my understanding, the way it supposed to work is that a value bigger
> then boundary get's normalized.
> 
> I just get an exception "bad x not in boundary rect"
> 
> Any pointers?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4227384.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Search opening hours

2015-08-26 Thread Darren Spehr
So thanks to the tireless efforts of David Smiley and the devs at Vivid
Solutions (not to mention the various contributors that help power Solr and
Lucene) spatial search is awesome, efficient and easy.  The biggest
roadblock I've run into is not having the JTS (Java Topology Suite) JAR
where Solr can find it. It doesn't ship with Solr OOB so you have to either
add it to one of the dynamic directories, or bundle it with the WAR (I
think pre-5.0). The link above has most of what you need to index data and
issue queries. I'd also suggest the sections on spatial search in Solr In
Action (Grainger, Potter) - they add a few more use cases that I've found
interesting. Finally, the aging wiki has some good info too:

http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4

Basically indexing spatial data is as easy as anything else: define the
field in the solrconfig.xml, create the data and push it in. Now the data
in this case are boxes or polygons (effectively the same here) and come in
a specific format known as WKT, or Well-Known-Text
https://en.wikipedia.org/wiki/Well-known_text. I'd say unless you're
aiming at an advanced use case set the max dist error on the field config a
little higher than normal - precision isn't really a requirement here and
good unit tests would alert you to any unforeseen issues. Then for the
query side of the world you just ask for point inclusion like:

q=+polygon:Contains(POINT(my_long my_lat))

Please note that WKT reverses the order of lat/lng because it uses
euclidean geometry heuristics (so X=longitude and Y=latitude). Can't tell
you how many times my brain hurt thanks to this idiom combined with janky
client logic :) Anyway, that's about it - let me know if you have any other
questions.


On Wed, Aug 26, 2015 at 1:56 PM, O. Klein kl...@octoweb.nl wrote:

 Darren,

 This sounds like solution I'm looking for. Especially nice fix for the
 Sunday-Monday problem.

 Never worked with spatial search before, so any pointers are welcome.

 Will start working on this solution.



 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225443.html
 Sent from the Solr - User mailing list archive at Nabble.com.




-- 
Darren


Re: Search opening hours

2015-08-26 Thread Darren Spehr
Sorry - didn't finish my thought. I need to address querying :) So using
the above to define what's in the index your queries for a day/time become
a CONTAINS operation against the field. Let's say that the field is defined
as a location_rpt using JTS and its Spatial Factory (which supports
polygons) - oh, and it would need to be multi-valued. Querying the field
would require first translating now or in an hour or Monday at 9am to
a geocode, then hitting the index with a CONTAINS request per the docs:

https://cwiki.apache.org/confluence/display/solr/Spatial+Search


On Wed, Aug 26, 2015 at 11:23 AM, Darren Spehr darre...@gmail.com wrote:

 Sure - and sorry for its density. I reread it and thought the same ;)

 So imagine a polygon of say 1/2 mile width (I made that up) that stretches
 around the equator. Let's call this a week's timeline and subdivide it into
 7 blocks, one for each day. For the sake of simplicity assume it's a line
 (which I forget but is supported in Solr as an infinitely small polygon)
 starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for
 Sunday at 11:59 PM. By subdivide you can think of it either radially or by
 longitude, but you have 360 degrees to divide into 7, which means that
 every hour is represented by a range of roughly 2.143 degrees (360/7/24).
 These regions represent each day and hour (or less), and the region
 boundaries represent midnight for the day before.

 Now for indexing - your open hours then become a combination of these
 subdivisions. If you're open 24x7 then the whole polygon is indexed. If
 you're only open on Monday from 9-5 then only the polygon between
 (0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you
 can index any combination of times this way.

 So now the varsity question is how to do this with a fluctuating calendar?
 I think this example can be extended to include searching against any given
 day of the week in a year, or years. Just imagine a translation layer that
 adjusts the latitude N or S by some amount to represent which day in which
 year you're looking for. Make sense?

 On Wed, Aug 26, 2015 at 10:50 AM, Upayavira u...@odoko.co.uk wrote:

 delightfully dense = really intriguing, but I couldn't quite
 understand it - really hoping for more info

 On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote:
  Darren,
 
  That was delightfully dense. Do you think you could unpack it a bit
  more? Possibly some sample (pseudo) queries?
 
  Upayavira
 
  On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
   If you wanted to try a spatial approach that blended times like above,
   you
   could try a polygon of minimum width that spans the globe - this is
   literally using spatial search (geocodes) against time. So in this
   scenario
   you logically subdivide the polygon into 7 distinct regions (for days)
   and
   then within this you can defined, like a timeline, what open and
 closed
   means. The problem of 3AM is taken care of because of it's continuous
   nature - ie one day is adjacent to the next, with Sunday and Monday
   backing
   up to each other. Just a thought.
  
   On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote:
  
   
   
On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
 Those options don't fix my problem with closing times the next
 morning,
 or is
 there a way to do this?
   
Use the spatial model, and a time window of a week. There are 10,080
minutes in a week, so you could use that as your scale.
   
Assuming the week starts at 00:00 Monday morning, you might index
 Monday
9:00-23:00 as  540:1380
   
Tuesday 9am-Wednesday 1am would be 1980:2940
   
You convert your NOW time into a minutes since Monday 00:00 and
 do a
spatial search within that time.
   
If it is now Monday, 11:23am, that would be 11*60+23=683, so you
 would
do a search for 683:683.
   
If you have a shop that is open over Sunday night to Monday, you
 just
list it as open until Sunday 23:59 and open again Monday 00:00.
   
Would that do it?
   
Upayavira
   
  
  
  
   --
   Darren




 --
 Darren




-- 
Darren


Re: Search opening hours

2015-08-26 Thread Darren Spehr
If you wanted to try a spatial approach that blended times like above, you
could try a polygon of minimum width that spans the globe - this is
literally using spatial search (geocodes) against time. So in this scenario
you logically subdivide the polygon into 7 distinct regions (for days) and
then within this you can defined, like a timeline, what open and closed
means. The problem of 3AM is taken care of because of it's continuous
nature - ie one day is adjacent to the next, with Sunday and Monday backing
up to each other. Just a thought.

On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote:



 On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
  Those options don't fix my problem with closing times the next morning,
  or is
  there a way to do this?

 Use the spatial model, and a time window of a week. There are 10,080
 minutes in a week, so you could use that as your scale.

 Assuming the week starts at 00:00 Monday morning, you might index Monday
 9:00-23:00 as  540:1380

 Tuesday 9am-Wednesday 1am would be 1980:2940

 You convert your NOW time into a minutes since Monday 00:00 and do a
 spatial search within that time.

 If it is now Monday, 11:23am, that would be 11*60+23=683, so you would
 do a search for 683:683.

 If you have a shop that is open over Sunday night to Monday, you just
 list it as open until Sunday 23:59 and open again Monday 00:00.

 Would that do it?

 Upayavira




-- 
Darren


Re: Search opening hours

2015-08-26 Thread Darren Spehr
Sure - and sorry for its density. I reread it and thought the same ;)

So imagine a polygon of say 1/2 mile width (I made that up) that stretches
around the equator. Let's call this a week's timeline and subdivide it into
7 blocks, one for each day. For the sake of simplicity assume it's a line
(which I forget but is supported in Solr as an infinitely small polygon)
starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for
Sunday at 11:59 PM. By subdivide you can think of it either radially or by
longitude, but you have 360 degrees to divide into 7, which means that
every hour is represented by a range of roughly 2.143 degrees (360/7/24).
These regions represent each day and hour (or less), and the region
boundaries represent midnight for the day before.

Now for indexing - your open hours then become a combination of these
subdivisions. If you're open 24x7 then the whole polygon is indexed. If
you're only open on Monday from 9-5 then only the polygon between
(0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you
can index any combination of times this way.

So now the varsity question is how to do this with a fluctuating calendar?
I think this example can be extended to include searching against any given
day of the week in a year, or years. Just imagine a translation layer that
adjusts the latitude N or S by some amount to represent which day in which
year you're looking for. Make sense?

On Wed, Aug 26, 2015 at 10:50 AM, Upayavira u...@odoko.co.uk wrote:

 delightfully dense = really intriguing, but I couldn't quite
 understand it - really hoping for more info

 On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote:
  Darren,
 
  That was delightfully dense. Do you think you could unpack it a bit
  more? Possibly some sample (pseudo) queries?
 
  Upayavira
 
  On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
   If you wanted to try a spatial approach that blended times like above,
   you
   could try a polygon of minimum width that spans the globe - this is
   literally using spatial search (geocodes) against time. So in this
   scenario
   you logically subdivide the polygon into 7 distinct regions (for days)
   and
   then within this you can defined, like a timeline, what open and closed
   means. The problem of 3AM is taken care of because of it's continuous
   nature - ie one day is adjacent to the next, with Sunday and Monday
   backing
   up to each other. Just a thought.
  
   On Wed, Aug 26, 2015 at 5:38 AM, Upayavira u...@odoko.co.uk wrote:
  
   
   
On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
 Those options don't fix my problem with closing times the next
 morning,
 or is
 there a way to do this?
   
Use the spatial model, and a time window of a week. There are 10,080
minutes in a week, so you could use that as your scale.
   
Assuming the week starts at 00:00 Monday morning, you might index
 Monday
9:00-23:00 as  540:1380
   
Tuesday 9am-Wednesday 1am would be 1980:2940
   
You convert your NOW time into a minutes since Monday 00:00 and do
 a
spatial search within that time.
   
If it is now Monday, 11:23am, that would be 11*60+23=683, so you
 would
do a search for 683:683.
   
If you have a shop that is open over Sunday night to Monday, you just
list it as open until Sunday 23:59 and open again Monday 00:00.
   
Would that do it?
   
Upayavira
   
  
  
  
   --
   Darren




-- 
Darren


Solr 4.10.3 start up issue

2015-01-21 Thread Darren Spehr
Hi everyone -

I posted a question on stackoverflow but in hindsight this would have been
a better place to start. Below is the link.

Basically I can't get the example working when using an external ZK cluster
and auto-core discovery. Solr 4.10.1 works fine, but the newest release
never gets new nodes into the active state. There are no errors or
warnings, and compared to the log output of 4.10.1, the difference is that
nodes never make it to leader election.

Here is the stackoverflow question, along with the full log output:
http://stackoverflow.com/questions/28004832/solr-4-10-3-is-not-proceeding-to-leader-election-on-new-cluster-startup-hangs

Any help and guidance would be appreciated. Thanks!

-- 
Darren


Re: Solr 4.10.3 start up issue

2015-01-21 Thread Darren Spehr
Thanks Hoss, this is exactly what I needed. I had previously run the
example using nothing more than an external ZK hosting my own
configuration. This of course means one of two things - my conf was bad, or
Solr was at fault. The conf has been working for ages so I didn't test a
replacement (it's amazing how a little frustration can fuel such hubris). I
had thought to do this before - and should have; I uploaded the full
example collection configuration to ZK just now and tried again. Magic, it
worked, which left me feeling a bit glum. Well, happy that it wasn't Solr.
Now if you'll excuse me, I have a conf review to perform.

Darren

On Wed, Jan 21, 2015 at 6:48 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : I posted a question on stackoverflow but in hindsight this would have
 been
 : a better place to start. Below is the link.
 :
 : Basically I can't get the example working when using an external ZK
 cluster
 : and auto-core discovery. Solr 4.10.1 works fine, but the newest release

 your SO URL shows the output of using your custom configs, but not what
 you got with the example configs -- so it's not clear to me if there is
 really just one problem, or perhaps 2?

 you also mentioned a lot of details about how you are using solr with zk,
 and what doens't work, but it's not clear if you tried other simpler steps
 using your configs -- or the example configs -- and if those simpler *did*
 work (ie: single node solr startup?)

 my best guess, based on the logs you did post and the mention of
 lib/mq/solr-search-ahead-2.0.0.jar in those logs, is that the entire
 question of zk and slcuster state and leaders is a red herring, and what
 you are running into is: SOLR-6643...

 https://issues.apache.org/jira/browse/SOLR-6643

 ...if i'm right, then simple core discovery with your configs on a single
 node solr instance w/o any knowledge of ZK will also fail to init the core
 -- and if you try to use the CoreAdmin API to CREATE a core, you'll ge
 some kind of LinkageError.




 : Here is the stackoverflow question, along with the full log output:
 :
 http://stackoverflow.com/questions/28004832/solr-4-10-3-is-not-proceeding-to-leader-election-on-new-cluster-startup-hangs


 -Hoss
 http://www.lucidworks.com/




-- 
Darren


RE: SolrCloud replica dies under high throughput

2014-07-23 Thread Darren Lee
Thanks that helped. I no longer see the constant replica recovery. It also 
increased my throughput to 1.6/1.7 million per hour reliably. I actually then 
tried using SSDs instead and it flew up to 6.5 million updates per hour.

Setup:
4 node cluster using m3.2xl AWS servers using general purpose SSDs.

Thanks again,
Darren


-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: 22 July 2014 00:25
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud replica dies under high throughput

Looks like you probably have to raise the http client connection pool limits to 
handle that kind of load currently.

They are specified as top level config in solr.xml:

maxUpdateConnections
maxUpdateConnectionsPerHost

--
Mark Miller
about.me/markrmiller

On July 21, 2014 at 7:14:59 PM, Darren Lee (d...@amplience.com) wrote:
 Hi,
  
 I'm doing some benchmarking with Solr Cloud 4.9.0. I am trying to work 
 out exactly how much throughput my cluster can handle.
  
 Consistently in my test I see a replica go into recovering state 
 forever caused by what looks like a timeout during replication. I can 
 understand the timeout and failure (I am hitting it fairly hard) but 
 what seems odd to me is that when I stop the heavy load it still does 
 not recover the next time it tries, it seems broken forever until I manually 
 go in, clear the index and let it do a full resync.
  
 Is this normal? Am I misunderstanding something? My cluster has 4 
 nodes (2 shards, 2 replicas) (AWS m3.2xlarge). I am indexing with ~800 
 concurrent connections and a 10 sec soft commit.
 I consistently get this problem with a throughput of around 1.5 
 million documents per hour.
  
 Thanks all,
 Darren
  
  
 Stack Traces  Messages:
  
 [qtp779330563-627] ERROR org.apache.solr.servlet.SolrDispatchFilter â 
 null:org.apache.http.conn.ConnectionPoolTimeoutException:  
 Timeout waiting for connection from pool at 
 org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnecti
 on(PoolingClientConnectionManager.java:226)
 at 
 org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnecti
 on(PoolingClientConnectionManager.java:195)
 at 
 org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequ
 estDirector.java:422) at 
 org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpC
 lient.java:863) at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpC
 lient.java:82) at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpC
 lient.java:106) at 
 org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpC
 lient.java:57) at 
 org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.ru
 n(ConcurrentUpdateSolrServer.java:233)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
 ava:1145) at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
 java:615) at java.lang.Thread.run(Thread.java:724)
  
 Error while trying to recover. 
 core=assets_shard2_replica1:java.util.concurrent.ExecutionException:  
 org.apache.solr.client.solrj.SolrServerException: IOException occured 
 when talking to server at: http://xxx.xxx.15.171:8080/solr at 
 java.util.concurrent.FutureTask.report(FutureTask.java:122)
 at java.util.concurrent.FutureTask.get(FutureTask.java:188)
 at 
 org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStr
 ategy.java:615) at 
 org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.jav
 a:371) at 
 org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235)
 Caused by: org.apache.solr.client.solrj.SolrServerException: 
 IOException occured when talking to server at: 
 http://xxx.xxx.15.171:8080/solr at 
 org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSol
 rServer.java:566) at 
 org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer
 .java:245) at 
 org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer
 .java:241) at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.j
 ava:1145) at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
 java:615) at java.lang.Thread.run(Thread.java:744)
 Caused by: java.net.SocketException: Socket closed at 
 java.net.SocketInputStream.socketRead0(Native Method) at 
 java.net.SocketInputStream.read(SocketInputStream.java:152)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
 at 
 org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(Abstract
 SessionInputBuffer.java:160) at 
 org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer
 .java:84) at 
 org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSe
 ssionInputBuffer.java:273) at 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultH
 ttpResponseParser.java:140) at 
 org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead

SolrCloud replica dies under high throughput

2014-07-21 Thread Darren Lee
Hi,

I'm doing some benchmarking with Solr Cloud 4.9.0. I am trying to work out 
exactly how much throughput my cluster can handle.

Consistently in my test I see a replica go into recovering state forever caused 
by what looks like a timeout during replication. I can understand the timeout 
and failure (I am hitting it fairly hard) but what seems odd to me is that when 
I stop the heavy load it still does not recover the next time it tries, it 
seems broken forever until I manually go in, clear the index and let it do a 
full resync.

Is this normal? Am I misunderstanding something? My cluster has 4 nodes (2 
shards, 2 replicas) (AWS m3.2xlarge). I am indexing with ~800 concurrent 
connections and a 10 sec soft commit. I consistently get this problem with a 
throughput of around 1.5 million documents per hour.

Thanks all,
Darren


Stack Traces  Messages:

[qtp779330563-627] ERROR org.apache.solr.servlet.SolrDispatchFilter  â 
null:org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for 
connection from pool
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
at 
org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:422)
at 
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

Error while trying to recover. 
core=assets_shard2_replica1:java.util.concurrent.ExecutionException: 
org.apache.solr.client.solrj.SolrServerException: IOException occured when 
talking to server at: http://xxx.xxx.15.171:8080/solr
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:188)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:615)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:371)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:235)
Caused by: org.apache.solr.client.solrj.SolrServerException: IOException 
occured when talking to server at: http://xxx.xxx.15.171:8080/solr
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:566)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer.java:245)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer$1.call(HttpSolrServer.java:241)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160)
at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283)
at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251)
at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123

SolrCloud - Highly Reliable / Scalable Resources?

2014-05-12 Thread Darren Lee
Hi everyone,

We have been using Solr Cloud (4.4) for ~ 6 months now. Functionally its 
excellent but we have suffered several issues which always seem quite 
problematic to resolve.

I was wondering if anyone in the community can recommend good resources / 
reading for setting up a highly scalable / highly reliable cluster. A lot of 
what I see in the solr documentation is aimed at small setups or is quite 
sparse.

Dealing with topics like:

* Capacity planning

* Losing nodes

* Voting panic

* Recovery failure

* Replication factors

* Elasticity / Auto scaling / Scaling recipes

* Exhibitor

* Container configuration, concurrency limits, packet drop tuning

* Increasing capacity without downtime

* Scalable approaches to full indexing hundreds of millions of documents

* External health check vs CloudSolrServer

* Separate vs local zookeeper

* Benchmarks


Sorry, I know that's a lot to ask heh. We are going to run a project for a 
month or so soon where we re-write all our run books and do deeper testing on 
various failure scenarios and the above but any starting point would be much 
appreciated.

Thanks all,
Darren


SolrCloud - Highly Reliable / Scalable Info

2014-05-12 Thread Darren Lee
Hi everyone,

We have been using Solr Cloud (4.4) for ~ 6 months now. Functionally its 
excellent but we have suffered several issues which always seem quite 
problematic to resolve.

I was wondering if anyone in the community can recommend good resources / 
reading for setting up a highly scalable / highly reliable cluster. A lot of 
what I see in the solr documentation is aimed at small setups or is quite 
sparse.

Dealing with topics like:

* Capacity planning

* Losing nodes

* Voting panic

* Recovery failure

* Replication factors

* Elasticity / Auto scaling / Scaling recipes

* Exhibitor

* Container configuration, concurrency limits, packet drop tuning

* Increasing capacity without downtime

* Scalable approaches to full indexing hundreds of millions of documents

* External health check vs CloudSolrServer

* Separate vs local zookeeper

* Benchmarks


Sorry, I know that's a lot to ask heh. We are going to run a project for a 
month or so soon where we re-write all our run books and do deeper testing on 
various failure scenarios and the above but any starting point would be much 
appreciated.

Thanks all,
Darren



MLT in SolrJ vs. URL?

2013-05-21 Thread Darren Govoni
Hi,
  I compose a mlt query in a URL and get the queried result back and a
list of  documents in the moreLikeThis section in my browser.

When I try to execute the same query in SolrJ setting the same params, I
only get the queried result document back and no MLT docs.

What's the trick here?

thanks,
Darren



Re: zk Config URL?

2013-02-25 Thread Darren Govoni
(AbstractInhabitantImpl.java:78)
at 
com.sun.enterprise.v3.server.AppServerStartup.run(AppServerStartup.java:253)
at 
com.sun.enterprise.v3.server.AppServerStartup.doStart(AppServerStartup.java:145)
at 
com.sun.enterprise.v3.server.AppServerStartup.start(AppServerStartup.java:136)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishImpl.start(GlassFishImpl.java:79)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishDecorator.start(GlassFishDecorator.java:63)
at 
com.sun.enterprise.glassfish.bootstrap.osgi.OSGiGlassFishImpl.start(OSGiGlassFishImpl.java:69)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishMain$Launcher.launch(GlassFishMain.java:117)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)
at 
com.sun.enterprise.glassfish.bootstrap.GlassFishMain.main(GlassFishMain.java:97)

at com.sun.enterprise.glassfish.bootstrap.ASMain.main(ASMain.java:55)
Caused by: java.lang.ClassNotFoundException: javax.servlet.Filter
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at sun.misc.Launcher$ExtClassLoader.findClass(Launcher.java:229)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 55 more


On 02/24/2013 08:32 PM, Mark Miller wrote:

You either have to specifically upload a config set or use one of the bootstrap 
sys props.

Are you doing either?

- Mark

On Feb 24, 2013, at 8:15 PM, Darren Govoni dar...@ontrenet.com wrote:


Thanks Michael.

I went ahead and just started an external zookeeper, but my solr node throws 
exceptions from it.

Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find 
configName for collection collection1 found:null

...

[#|2013-02-24T20:13:58.451-0500|SEVERE|glassfish3.1.2|org.apache.solr.core.CoreContainer|_ThreadID=28;_ThreadName=Thread-2;|null:org.apache.solr.common.SolrException:
 Unable to create core: collection1
at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not find 
configName for collection collection1 found:null
at org.apache.solr.cloud.ZkController.getConfName(ZkController.java:1097)
at 
org.apache.solr.cloud.ZkController.createCollectionZkNode(ZkController.java:1016)
at org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:937)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031)
... 10 more


On 02/24/2013 07:21 PM, Michael Della Bitta wrote:

Hello Darren,

If you go into the admin and click on Cloud, you'll see that
information represented in a number of ways. Both Dump and Tree
(especially the clusterstate.json file) have this information
represented as a document in JSON format.

If you don't see the Cloud navigation on the left side of the admin
screen, that's a good indication that Solr hasn't connected to
Zookeeper.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Sun, Feb 24, 2013 at 6:34 PM, Darren Govoni dar...@ontrenet.com wrote:

Hi,
I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't
find that shows me the zookeeper config XML,
so I can check what other nodes are connected? Can't seem to find it.

I deploy my solrcloud war into glassfish and set jetty.port (among other
properties) to the GF domain port (e.g. 8181).'
It starts successfully.

I want zookeeper to run automatically within (as needed). How can I verify
this or refer to
the first/master server using zkHost from another node? (e.g. {host}:{port})
to form a cluster.

I did this before a while ago, before solr 4.x was released, but things have
changed.

tips appreciated. thank you.
Darren




Re: zk Config URL?

2013-02-25 Thread darren
Ok. But its way too complicated than it should be. It should work smarter.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Anirudha Jadhav aniru...@nyu.edu 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: zk Config URL? 
 
Solr cloud reads solr cfg files from zookeeper.

You need to push the cfg to zookeeper  link collection to cfg.
This is exactly what mark suggested earlier in the thread. This is also
explained in solr cloud wiki.

On Monday, February 25, 2013, Darren Govoni wrote:

 Hi Mark,

    I download latest zk, and run it.

    In my glassfish server, I set these system wide properties:

 numShards = 1
 zkHost = 10.x.x.x:2181
 jetty.port = 8080 (port of my domain)
 bootstrap_config = true

 I copy all the solr 4.1 dist/*.jar into my glassfish domain lib/ext
 directory. Then I deploy solr 4.1 war.
 It throws this exception always.

 [#|2013-02-25T13:31:32.304+**|INFO|glassfish3.1.2|**
 javax.enterprise.system.**container.web.com.sun.**
 enterprise.web|_ThreadID=10;_**ThreadName=Thread-2;|WEB0171: Created
 virtual server [__asadmin]|#]

 [#|2013-02-25T13:31:32.768+**|INFO|glassfish3.1.2|**
 javax.enterprise.system.**container.web.com.sun.**
 enterprise.web|_ThreadID=10;_**ThreadName=Thread-2;|WEB0172: Virtual
 server [server] loaded default web module []|#]

 [#|2013-02-25T13:31:34.222+**|WARNING|glassfish3.1.2|**
 javax.enterprise.system.tools.**deployment.org.glassfish.**
 deployment.common|_ThreadID=**10;_ThreadName=Thread-2;|**DPL8007:
 Unsupported deployment descriptors element schemaLocation value
 http://www.bea.com/ns/**weblogic/90 http://www.bea.com/ns/weblogic/90
 http://www.bea.com/ns/**weblogic/90/weblogic-web-app.**xsd|#http://www.bea.com/ns/weblogic/90/weblogic-web-app.xsd%7C#
 ]

 [#|2013-02-25T13:31:34.223+**|SEVERE|glassfish3.1.2|**
 javax.enterprise.system.tools.**deployment.org.glassfish.**
 deployment.common|_ThreadID=**10;_ThreadName=Thread-2;|**DPL8006: get/add
 descriptor failure : filter-dispatched-requests-**enabled TO false|#]

 [#|2013-02-25T13:31:34.831+**|SEVERE|glassfish3.1.2|**
 javax.enterprise.system.**container.web.com.sun.**
 enterprise.web|_ThreadID=10;_**ThreadName=Thread-2;|**WebModule[/solr1]PWC1270:
 Exception starting filter SolrRequestFilter
 java.lang.**NoClassDefFoundError: javax/servlet/Filter
 at java.lang.ClassLoader.**defineClass1(Native Method)
 at java.lang.ClassLoader.**defineClassCond(ClassLoader.**java:631)
 at java.lang.ClassLoader.**defineClass(ClassLoader.java:**615)
 at java.security.**SecureClassLoader.defineClass(**
 SecureClassLoader.java:141)
 at java.net.URLClassLoader.**defineClass(URLClassLoader.**java:283)
 at java.net.URLClassLoader.**access$000(URLClassLoader.**java:58)
 at java.net.URLClassLoader$1.run(**URLClassLoader.java:197)
 at java.security.**AccessController.doPrivileged(**Native Method)
 at java.net.URLClassLoader.**findClass(URLClassLoader.java:**190)
 at sun.misc.Launcher$**ExtClassLoader.findClass(**Launcher.java:229)
 at java.lang.ClassLoader.**loadClass(ClassLoader.java:**306)
 at java.lang.ClassLoader.**loadClass(ClassLoader.java:**295)
 at com.sun.enterprise.v3.server.**APIClassLoaderServiceImpl$**
 APIClassLoader.loadClass(**APIClassLoaderServiceImpl.**java:206)
 at java.lang.ClassLoader.**loadClass(ClassLoader.java:**295)
 at java.lang.ClassLoader.**loadClass(ClassLoader.java:**295)
 at java.lang.ClassLoader.**loadClass(ClassLoader.java:**247)
 at org.glassfish.web.loader.**WebappClassLoader.loadClass(**
 WebappClassLoader.java:1456)
 at org.glassfish.web.loader.**WebappClassLoader.loadClass(**
 WebappClassLoader.java:1359)
 at org.apache.catalina.core.**ApplicationFilterConfig.**
 loadFilterClass(**ApplicationFilterConfig.java:**280)
 at org.apache.catalina.core.**ApplicationFilterConfig.**getFilter(**
 ApplicationFilterConfig.java:**250)
 at org.apache.catalina.core.**ApplicationFilterConfig.init**
 (ApplicationFilterConfig.java:**120)
 at org.apache.catalina.core.**StandardContext.filterStart(**
 StandardContext.java:4685)
 at org.apache.catalina.core.**StandardContext.start(**
 StandardContext.java:5377)
 at com.sun.enterprise.web.**WebModule.start(WebModule.**java:498)
 at org.apache.catalina.core.**ContainerBase.**addChildInternal(**
 ContainerBase.java:917)
 at org.apache.catalina.core.**ContainerBase.addChild(**
 ContainerBase.java:901)
 at org.apache.catalina.core.**StandardHost.addChild(**
 StandardHost.java:733)
 at com.sun.enterprise.web.**WebContainer.loadWebModule(**
 WebContainer.java:2019)
 at com.sun.enterprise.web.**WebContainer.loadWebModule(**
 WebContainer.java:1669)
 at com.sun.enterprise.web.**WebApplication.start(**
 WebApplication.java:109)
 at org.glassfish.internal.data.**EngineRef.start(EngineRef.**java:130)
 at org.glassfish.internal.data.**ModuleInfo.start(ModuleInfo.**
 java:269

zk Config URL?

2013-02-24 Thread Darren Govoni

Hi,
   I'm trying the latest solrcloud 4.1. Is there a button(or url) I 
can't find that shows me the zookeeper config XML,

so I can check what other nodes are connected? Can't seem to find it.

I deploy my solrcloud war into glassfish and set jetty.port (among other 
properties) to the GF domain port (e.g. 8181).'

It starts successfully.

I want zookeeper to run automatically within (as needed). How can I 
verify this or refer to
the first/master server using zkHost from another node? (e.g. 
{host}:{port}) to form a cluster.


I did this before a while ago, before solr 4.x was released, but things 
have changed.


tips appreciated. thank you.
Darren


Re: zk Config URL?

2013-02-24 Thread Darren Govoni

Thanks Michael.

I went ahead and just started an external zookeeper, but my solr node 
throws exceptions from it.


Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not 
find configName for collection collection1 found:null


...

[#|2013-02-24T20:13:58.451-0500|SEVERE|glassfish3.1.2|org.apache.solr.core.CoreContainer|_ThreadID=28;_ThreadName=Thread-2;|null:org.apache.solr.common.SolrException: 
Unable to create core: collection1
at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1654)

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1039)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:629)
at org.apache.solr.core.CoreContainer$3.call(CoreContainer.java:624)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
Caused by: org.apache.solr.common.cloud.ZooKeeperException: Could not 
find configName for collection collection1 found:null
at 
org.apache.solr.cloud.ZkController.getConfName(ZkController.java:1097)
at 
org.apache.solr.cloud.ZkController.createCollectionZkNode(ZkController.java:1016)
at 
org.apache.solr.core.CoreContainer.createFromZk(CoreContainer.java:937)

at org.apache.solr.core.CoreContainer.create(CoreContainer.java:1031)
... 10 more


On 02/24/2013 07:21 PM, Michael Della Bitta wrote:

Hello Darren,

If you go into the admin and click on Cloud, you'll see that
information represented in a number of ways. Both Dump and Tree
(especially the clusterstate.json file) have this information
represented as a document in JSON format.

If you don't see the Cloud navigation on the left side of the admin
screen, that's a good indication that Solr hasn't connected to
Zookeeper.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Sun, Feb 24, 2013 at 6:34 PM, Darren Govoni dar...@ontrenet.com wrote:

Hi,
I'm trying the latest solrcloud 4.1. Is there a button(or url) I can't
find that shows me the zookeeper config XML,
so I can check what other nodes are connected? Can't seem to find it.

I deploy my solrcloud war into glassfish and set jetty.port (among other
properties) to the GF domain port (e.g. 8181).'
It starts successfully.

I want zookeeper to run automatically within (as needed). How can I verify
this or refer to
the first/master server using zkHost from another node? (e.g. {host}:{port})
to form a cluster.

I did this before a while ago, before solr 4.x was released, but things have
changed.

tips appreciated. thank you.
Darren




RE: SolrJ and Solr 4.0 | doc.getFieldValue() returns String instead of Date

2013-01-08 Thread Darren Govoni

SimpleDateFormat df= new SimpleDateFormat(-MM-dd'T'hh:mm:ss.S'Z');
Date dateObj = df.parse(2009-10-29T00:00:009Z);

brbrbr--- Original Message ---
On 1/8/2013  09:34 AM uwe72 wrote:brA Lucene 4.0 document returns for a Date 
field now a string value, instead of
bra Date object.
br
brfield name=ModuleImpl.versionAsDate view=Datenstand type=date 
br

brSolr4.0 -- 2009-10-29T00:00:009Z
brSolr3.6 -- Date instance
br
brCan this be set somewhere in the config?
br
brI prefer to receive a date instance
br
br
br
br--
brView this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-and-Solr-4-0-doc-getFieldValue-returns-String-instead-of-Date-tp4031588.html
brSent from the Solr - User mailing list archive at Nabble.com.
br


RE: RE: Max number of core in Solr multi-core

2013-01-07 Thread Darren Govoni

This should be clarified some. In the client API, SolrServer is represents a 
connection to a single server backend/endpoint and should be re-used where possible.

The approach being discussed is to have one client connection (represented by SolrServer class) per solr core, all residing in a single solr server (as is the case below, but not required). 


brbrbr--- Original Message ---
On 1/7/2013  08:06 AM Jay Parashar wrote:brThis is the exact approach we use 
in our multithreaded env. One server per
brcore. I think this is the recommended approach.
br
br-Original Message-
brFrom: Parvin Gasimzade [mailto:parvin.gasimz...@gmail.com] 
brSent: Monday, January 07, 2013 7:00 AM

brTo: solr-user@lucene.apache.org
brSubject: Re: Max number of core in Solr multi-core
br
brI know that but my question is different. Let me ask it in this way.
br
brI have a solr with base url localhost:8998/solr and two solr core as
brlocalhost:8998/solr/core1 and localhost:8998/solr/core2.
br
brI have one baseSolr instance initialized as :
brSolrServer server = new HttpSolrServer( url );
br
brI have also create SolrServer's for each core as :
brSolrServer core1 = new HttpSolrServer( url + /core1 ); SolrServer core2 =
brnew HttpSolrServer( url + /core2 );
br
brSince there are many cores, I have to initialize SolrServer as shown above.
brIs there a way to create only one SolrServer with the base url and access
breach core using it? If it is possible, then I don't need to create new
brSolrServer for each core.
br
brOn Mon, Jan 7, 2013 at 2:39 PM, Erick Erickson
brerickerick...@gmail.comwrote:
br
br This might help:
br https://wiki.apache.org/solr/Solrj#HttpSolrServer
br
br Note that the associated SolrRequest takes the path, I presume 
br relative to the base URL you initialized the HttpSolrServer with.

br
br Best
br Erick
br
br
br On Mon, Jan 7, 2013 at 7:02 AM, Parvin Gasimzade  
br parvin.gasimz...@gmail.com

br  wrote:
br
br  Thank you for your responses. I have one more question related to 
br  Solr multi-core.
br  By using SolrJ I create new core for each application. When user 
br  wants to add data or make query on his application, I create new 
br  HttpSolrServer

br for
br  this core. In this scenario there will be many running 
br  HttpSolrServer instances.

br 
br  Is there a better solution? Does it cause a problem to run many 
br  instances at the same time?

br 
br  On Wed, Jan 2, 2013 at 5:35 PM, Per Steffensen st...@designware.dk
br  wrote:
br 
br   g a collection per application instead of a core
br 
br
br
br


Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
This is a good explanation and makes sense. The one inconsistency is referring 
to a replica of a shard that has no replication. But its not that big of a 
problem. If you wove the term 'core' into your writeup below it would be 
complete and should be posted on the wiki.



Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Jack Krupansky j...@basetechnology.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Replication makes perfect sense even if our explanations so far do not.

A shard is an abstraction of a subset of the data for a collection.

A replica is an instance of the data of the shard and instances of Solr 
servers that have indicated a readiness to service queries and updates for 
the data. Alternatively, a replica is a node which has indicated a readiness 
to receive and serve the data of a shard, but may not have any data at the 
moment.

Lets describe it operationally for SolrCloud: If data comes in to any 
replica of a shard it will automatically and quickly be replicated to all 
other replicas of the shard. If a new replica of a shard comes up it will be 
streamed all of the data from the another replica of the shard. If an 
existing replica of a shard restarts or reconnects to the cluster, it will 
be streamed updates of any new data since it was last updated from another 
replica of the shard.

Replication is simply the process of assuring that all replicas are kept up 
to date. That's the same abstract meaning as for Master/Slave even though 
the operational details are somewhat different. The goal remains the same.

Replication factor is the number of instances of the data of the shard and 
instances of Solr servers that can service queries and updates for the data. 
Alternatively, the replication factor is the number of nodes of the 
SolrCloud cluster  which have indicated a readiness to receive and serve the 
data of a shard, but may not have any data at the moment.

A node is an instance of Solr running in a Java JVM that has indicated to 
the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a 
replica for a shard of a collection. [The latter part of that is a bit too 
fuzzy - I'm not sure what the node tells Zookeeper and who does shard 
assignment. I mean, does a node explicitly say what shard it wants to be, or 
is that assigned by Zookeeper, or is that a node's choice/option? But none 
of that changes the fact that a node registers with Zookeeper and then 
somehow becomes a replica for a shard.]

A node (instance of a Solr server) can be a replica of shards from multiple 
collections (potentially multiple shards per collection). A node is not a 
replica per se, but a container that can serve multiple collections. A node 
can serve as multiple replicas, each of a different collection.

My only interest here on this user list is to understand and explain the 
terms we have today and that SEEM to be working for the most part, even 
though we may not have defined them carefully enough and used them 
consistently enough.

If somebody want to propose an alternative terminology - fine, discuss that 
on the dev list and/or file a Jira.

I won't claim that my definitions are perfect (yet), but perfecting the 
definitions (for users) should be separated from changing the terms 
themselves.

-- Jack Krupansky

-Original Message- 
From: Per Steffensen
Sent: Friday, January 04, 2013 2:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

On 1/3/13 5:58 PM, Walter Underwood wrote:
 A factor is multiplied, so multiplying the leader by a replicationFactor 
 of 1 means you have exactly one copy of that shard.

 I think that recycling the term replication within Solr was confusing, 
 but it is a bit late to change that.

 wunder
Yes, the term factor is not misleading, but the term replication is.
If we keep calling shard-instances for Replica I guess replicaFactor
will be ok - at least much better than replicationFactor. But it would
still be better with e.g. ShardInstance and InstancesPerShard 



Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
Yes. Thats it. Its clear if we separate logical terms from physical terms. A 
simple cake diagram on the wiki along with perhaps a uml will solidify these 
concepts.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Jack Krupansky j...@basetechnology.com 
Date:  
To: solr-user@lucene.apache.org,darren dar...@ontrenet.com 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
I thought about adding Solr core, but it only muddies the water. Yes, it 
needs to be added, but carefully.

In the context of SolrCloud, a Solr core is the underlying representation of 
a replica. Alternatively, a replica of a shard of a collection is 
implemented as a Solr core. [Need to factor in the potential for multiple 
shards on a single node.] Or, a Solr core is capable of serving as a replica 
of a shard. A Solr core has a collection name but can exist without being 
registered with Zookeeper, so it may not be a replica of a 
zookeeper-registered collection.

Something like that. Not quite there yet.

The main point, I think, is that when we talk about SolrCloud or a Solr 
cluster it would be better for people to speak of replicas and shards and 
collections than cores since core is the implementation rather than the 
abstraction. I mean, at the level of cores, they know of only documents and 
fields, not shards, replicas, and the overall structure of collections and 
the cluster. Sure, the core has the name of the collection, but cores on 
other nodes can use that same name.

-- Jack Krupansky

-Original Message- 
From: darren
Sent: Friday, January 04, 2013 9:00 AM
To: j...@basetechnology.com ; solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

This is a good explanation and makes sense. The one inconsistency is 
referring to a replica of a shard that has no replication. But its not that 
big of a problem. If you wove the term 'core' into your writeup below it 
would be complete and should be posted on the wiki.



Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Jack Krupansky j...@basetechnology.com
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Replication makes perfect sense even if our explanations so far do not.

A shard is an abstraction of a subset of the data for a collection.

A replica is an instance of the data of the shard and instances of Solr
servers that have indicated a readiness to service queries and updates for
the data. Alternatively, a replica is a node which has indicated a readiness
to receive and serve the data of a shard, but may not have any data at the
moment.

Lets describe it operationally for SolrCloud: If data comes in to any
replica of a shard it will automatically and quickly be replicated to all
other replicas of the shard. If a new replica of a shard comes up it will be
streamed all of the data from the another replica of the shard. If an
existing replica of a shard restarts or reconnects to the cluster, it will
be streamed updates of any new data since it was last updated from another
replica of the shard.

Replication is simply the process of assuring that all replicas are kept up
to date. That's the same abstract meaning as for Master/Slave even though
the operational details are somewhat different. The goal remains the same.

Replication factor is the number of instances of the data of the shard and
instances of Solr servers that can service queries and updates for the data.
Alternatively, the replication factor is the number of nodes of the
SolrCloud cluster  which have indicated a readiness to receive and serve the
data of a shard, but may not have any data at the moment.

A node is an instance of Solr running in a Java JVM that has indicated to
the Zookeeper ensemble of a SolrCloud cluster that it is ready to be a
replica for a shard of a collection. [The latter part of that is a bit too
fuzzy - I'm not sure what the node tells Zookeeper and who does shard
assignment. I mean, does a node explicitly say what shard it wants to be, or
is that assigned by Zookeeper, or is that a node's choice/option? But none
of that changes the fact that a node registers with Zookeeper and then
somehow becomes a replica for a shard.]

A node (instance of a Solr server) can be a replica of shards from multiple
collections (potentially multiple shards per collection). A node is not a
replica per se, but a container that can serve multiple collections. A node
can serve as multiple replicas, each of a different collection.

My only interest here on this user list is to understand and explain the
terms we have today and that SEEM to be working for the most part, even
though we may not have defined them carefully enough and used them
consistently enough.

If somebody want to propose an alternative terminology - fine, discuss that
on the dev list and/or file a Jira.

I won't claim that my definitions are perfect (yet

Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
Agreed. But for completeness can it be node/collection/shard/replica/core?


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Yonik Seeley yo...@lucidworks.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote:
 Our biggest problem is that we really havent decided once and for all and
 made sure to reflect the decision consistently across code and
 documentation. As long as we havnt I believe it is still ok to change our
 minds.

IMO, I *think* it's settled: It's collection consists of 1 or more
shards, which each consist of one or more replicas.

A *long* time ago (3 years actually), I tried to get slice used in
place of shard just because shard was already used ambiguously by
people for both physical and logical shards, but it never caught on,
and as I recall no one could really agree on a set of terms that
satisfied everyone.  Attempting to replace Replica with something
like Shard Instance could actually end up being worse since it's a
mouthful and people would tend to shorten it to shard when talking
about it.

From a practical standpoint, I don't think people will be confused by
the current terminology once we document it well (we should probably
start with collection/shard/replica).  It's mostly an issue of when
one goes looking for inconsistencies or things that might not make
sense.  And as has been pointed out, others use the exact same
terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication

In the *code* I have been migrating away from shard as the physical
kind.  I've also used slice as a synonym for logical shard in the
code because of this mixed history of shard and since removing all
remnants of the use of shard as physical all at once would be
impractical.  Anyone who works on the code should not be bothered by
an extra synonym, and things will continue to be cleaned up over time.

-Yonik
http://lucidworks.com


Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
Actually. Node/collection/shard/replica/core/index


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: darren dar...@ontrenet.com 
Date:  
To: yo...@lucidworks.com,solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Agreed. But for completeness can it be node/collection/shard/replica/core?


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Yonik Seeley yo...@lucidworks.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 

On Fri, Jan 4, 2013 at 2:26 AM, Per Steffensen st...@designware.dk wrote:
 Our biggest problem is that we really havent decided once and for all and
 made sure to reflect the decision consistently across code and
 documentation. As long as we havnt I believe it is still ok to change our
 minds.

IMO, I *think* it's settled: It's collection consists of 1 or more
shards, which each consist of one or more replicas.

A *long* time ago (3 years actually), I tried to get slice used in
place of shard just because shard was already used ambiguously by
people for both physical and logical shards, but it never caught on,
and as I recall no one could really agree on a set of terms that
satisfied everyone.  Attempting to replace Replica with something
like Shard Instance could actually end up being worse since it's a
mouthful and people would tend to shorten it to shard when talking
about it.

From a practical standpoint, I don't think people will be confused by
the current terminology once we document it well (we should probably
start with collection/shard/replica).  It's mostly an issue of when
one goes looking for inconsistencies or things that might not make
sense.  And as has been pointed out, others use the exact same
terminology: http://www.datastax.com/docs/1.0/cluster_architecture/replication

In the *code* I have been migrating away from shard as the physical
kind.  I've also used slice as a synonym for logical shard in the
code because of this mixed history of shard and since removing all
remnants of the use of shard as physical all at once would be
impractical.  Anyone who works on the code should not be bothered by
an extra synonym, and things will continue to be cleaned up over time.

-Yonik
http://lucidworks.com


Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
My understanding is core is a logical solr term. Index is a physical lucene 
term. A solr core is backed by a physical lucene index. One index per core. 
Solr team can correct me if its not accurate. :)


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Alexandre Rafalovitch arafa...@gmail.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Can I just start by saying that this was AMAZING. :-) When I asked the
question, I certainly did not expect this level of details.

And I vote on the cake diagram for WIKI as well. Perhaps, two with the
first one showing the trivial collapsed state of single
collection/shard/replica/core. The trivial one will also help to explain
why the example is now called 'collection1'.

I think I followed everything, except for just added term of 'index'. Isn't
that the same as 'core'? Or can we have several indexes in one core?

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:

 This is the containment hierarchy i understand but includes both physical
 and logical.

  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...

 Actually. Node/collection/shard/replica/core/index



  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: yo...@lucidworks.com,solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...


 Agreed. But for completeness can it be node/collection/shard/replica/core?




Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
I agree. In my opinion index is a low level lucene thing. I never say a 
collection has an index directly. That confuses levels and creates confusion. 
To me at least. I think the terminology discussed is good. Just some lingering 
usage inconsistencies.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Alexandre Rafalovitch arafa...@gmail.com 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Hmm. Doesn't that make (logical) index=collection? And (physical)
index=core? Which creates duplication of terminology and at the same time
can cause confusion between highest logical and lowest physical level.

Regards,
   Alex.
P.s. Hoping not to start a new terminology war.

Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 1:21 PM, Jack Krupansky j...@basetechnology.comwrote:

 The entire collection does have an index - a distributed index - which
 consists of a Lucene index on each core/replica for the subset of the data
 in that shard.

 -- Jack Krupansky

 -Original Message- From: Alexandre Rafalovitch
 Sent: Friday, January 04, 2013 1:12 PM
 To: solr-user@lucene.apache.org

 Subject: Re: Terminology question: Core vs. Collection vs...

 Can I just start by saying that this was AMAZING. :-) When I asked the
 question, I certainly did not expect this level of details.

 And I vote on the cake diagram for WIKI as well. Perhaps, two with the
 first one showing the trivial collapsed state of single
 collection/shard/replica/core. The trivial one will also help to explain
 why the example is now called 'collection1'.

 I think I followed everything, except for just added term of 'index'. Isn't
 that the same as 'core'? Or can we have several indexes in one core?

 Regards,
   Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: 
 http://www.linkedin.com/in/**alexandrerafalovitchhttp://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


 On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:

  This is the containment hierarchy i understand but includes both physical
 and logical.

  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: dar...@ontrenet.com,yonik@**lucidworks.com yo...@lucidworks.com,
 solr-user@**lucene.apache.org solr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...

 Actually. Node/collection/shard/replica/**core/index



  Original message 
 From: darren dar...@ontrenet.com
 Date:
 To: 
 yo...@lucidworks.com,solr-**u...@lucene.apache.orgsolr-user@lucene.apache.org
 Subject: Re: Terminology question: Core vs. Collection vs...


 Agreed. But for completeness can it be node/collection/shard/replica/**
 core?






Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread darren
Good point. Agree.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Upayavira u...@odoko.co.uk 
Date:  
To: solr-user@lucene.apache.org 
Subject: Re: Terminology question: Core vs. Collection vs... 
 
Using your terminology, I'd say core is a physical solr term, and index
is a pysical lucene term. A collection or a shard is a logical solr
term.

Upayavira

On Fri, Jan 4, 2013, at 06:28 PM, darren wrote:
 My understanding is core is a logical solr term. Index is a physical
 lucene term. A solr core is backed by a physical lucene index. One index
 per core. Solr team can correct me if its not accurate. :)
 
 
 Sent from my Verizon Wireless 4G LTE Smartphone
 
  Original message 
 From: Alexandre Rafalovitch arafa...@gmail.com 
 Date:  
 To: solr-user@lucene.apache.org 
 Subject: Re: Terminology question: Core vs. Collection vs... 
  
 Can I just start by saying that this was AMAZING. :-) When I asked the
 question, I certainly did not expect this level of details.
 
 And I vote on the cake diagram for WIKI as well. Perhaps, two with the
 first one showing the trivial collapsed state of single
 collection/shard/replica/core. The trivial one will also help to explain
 why the example is now called 'collection1'.
 
 I think I followed everything, except for just added term of 'index'.
 Isn't
 that the same as 'core'? Or can we have several indexes in one core?
 
 Regards,
    Alex.
 Personal blog: http://blog.outerthoughts.com/
 LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
 - Time is the quality of nature that keeps events from happening all at
 once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
 
 
 On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:
 
  This is the containment hierarchy i understand but includes both physical
  and logical.
 
   Original message 
  From: darren dar...@ontrenet.com
  Date:
  To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
  Subject: Re: Terminology question: Core vs. Collection vs...
 
  Actually. Node/collection/shard/replica/core/index
 
 
 
   Original message 
  From: darren dar...@ontrenet.com
  Date:
  To: yo...@lucidworks.com,solr-user@lucene.apache.org
  Subject: Re: Terminology question: Core vs. Collection vs...
 
 
  Agreed. But for completeness can it be node/collection/shard/replica/core?
 
 


Re: Terminology question: Core vs. Collection vs...

2013-01-04 Thread Darren Govoni
Yes. In that case, core should best be described as a logical solr 
entity with various managed attributes
and qualities above the physical layer (sorry, not trying to perpetuate 
this thread so much).


On 01/04/2013 01:55 PM, Mark Miller wrote:

Currently a SolrCore is 1:1 with a low level Lucene index. There is no reason 
that needs to alway be that way. It's possible that we may at some point add 
built in micro sharding support that means a SolrCore could have multiple 
underlying Lucene indexes. Or we may not.

- Mark


On Jan 4, 2013, at 1:49 PM, darren dar...@ontrenet.com wrote:


Good point. Agree.


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Upayavira u...@odoko.co.uk
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Using your terminology, I'd say core is a physical solr term, and index
is a pysical lucene term. A collection or a shard is a logical solr
term.

Upayavira

On Fri, Jan 4, 2013, at 06:28 PM, darren wrote:

My understanding is core is a logical solr term. Index is a physical
lucene term. A solr core is backed by a physical lucene index. One index
per core. Solr team can correct me if its not accurate. :)


Sent from my Verizon Wireless 4G LTE Smartphone

 Original message 
From: Alexandre Rafalovitch arafa...@gmail.com
Date:
To: solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...
   
Can I just start by saying that this was AMAZING. :-) When I asked the

question, I certainly did not expect this level of details.

And I vote on the cake diagram for WIKI as well. Perhaps, two with the
first one showing the trivial collapsed state of single
collection/shard/replica/core. The trivial one will also help to explain
why the example is now called 'collection1'.

I think I followed everything, except for just added term of 'index'.
Isn't
that the same as 'core'? Or can we have several indexes in one core?

Regards,
Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Fri, Jan 4, 2013 at 10:11 AM, darren dar...@ontrenet.com wrote:


This is the containment hierarchy i understand but includes both physical
and logical.

 Original message 
From: darren dar...@ontrenet.com
Date:
To: dar...@ontrenet.com,yo...@lucidworks.com,solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...

Actually. Node/collection/shard/replica/core/index



 Original message 
From: darren dar...@ontrenet.com
Date:
To: yo...@lucidworks.com,solr-user@lucene.apache.org
Subject: Re: Terminology question: Core vs. Collection vs...


Agreed. But for completeness can it be node/collection/shard/replica/core?






RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni
Good write up. 


And what about node?

I think there needs to be an official glossary of terms that is sanctioned by the solr 
team and some terms still ni use may need to be labeled deprecated. After so 
many years, its still confusing.

brbrbr--- Original Message ---
On 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more modern term and incorporates the fact that the 
brcollection may be sharded, with each shard on one or more cores, with each 
brcore being a replica of the other cores within that shard of that 
brcollection.

br
brInstance is a general term, but is commonly used to refer to a running Solr 
brserver, each of which can service any number of cores. A sharded collection 
brwould typically require multiple instances of Solr, each with a shard of the 
brcollection.

br
brMultiple collections can be supported on a single instance of Solr. They 
brdon't have to be sharded or replicated. But if they are, each Solr instance 
brwill have a copy or replica of the data (index) of one shard of each sharded 
brcollection - to the degree that each collection needs that many shards.

br
brAt the API level, you talk to a Solr instance, using a host and port, and 
brgiving the collection name. Some operations will refer only to the portion 
brof a multi-shard collection on that Solr instance, but typically Solr will 
brdistribute the operation, whether it be an update or a query, to all of 
brthe shards of the named collection. In the case of update, the update will 
brbe distributed to all replicas as well, but in the case of query only one 
brreplica of each shard of the collection is needed.

br
brBefore SolrCloud we Solr had master and slave and the slaves were replicas 
brof the master, but with SolrCloud there is no master and all the replicas of 
brthe shard are peers, although at any moment of time one of them will be 
brconsidered the leader for coordination purposes, but not in the sense that 
brit is a master of the other replicas in that shard. A SolrCloud replica is a 
brreplica of the data, in an abstract sense, for a single shard of a 
brcollection. A SolrCloud replica is more of an instance of the data/index.

br
brAn index exists at two levels: the portion of a collection on a single Solr 
brcore will have a Lucene index, but collectively the Lucene indexes for the 
brshards of a collection can be referred to the index of the collection. Each 
brreplica is a copy or instance of a portion of the collection's index.

br
brThe term slice is sometimes used to refer collectively to all of the 
brcores/replicas of a single shard, or sometimes to a single replica as it 
brcontains only a slice of the full collection data.

br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Alexandre Rafalovitch

brSent: Thursday, January 03, 2013 4:42 AM
brTo: solr-user@lucene.apache.org
brSubject: Terminology question: Core vs. Collection vs...
br
brHello,
br
brI am trying to understand the core Solr terminology. I am looking for
brcorrect rather than loose meaning as I am trying to teach an example that
brstarts from easy scenario and may scale to multi-core, multi-machine
brsituation.
br
brHere are the terms that seem to be all overlapping and/or crossing over in
brmy mind a the moment.
br
br1) Index
br2) Core
br3) Collection
br4) Instance
br5) Replica (Replica of _what_?)
br6) Others?
br
brI tried looking through documentation, but either there is a terminology
brdrift or I am having trouble understanding the distinctions.
br
brIf anybody has a clear picture in their mind, I would appreciate a
brclarification.
br
brRegards,
br   Alex.
br
brPersonal blog: http://blog.outerthoughts.com/
brLinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
br- Time is the quality of nature that keeps events from happening all at
bronce. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book) 
br

br


RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Thanks again. (And sorry to jump into this convo)

But I had a question on your statement:

On 1/3/2013 08:07 AM Jack Krupansky wrote:
  brCollection is the more modern term and incorporates the fact that the 
brcollection may be sharded, with each shard on one or more cores, with each 
brcore being a replica of the other cores within that shard of that
brcollection. 


A collection is sharded, meaning it is distributed across cores. A shard itself 
is not distributed across cores in the same since. Rather a shard exist on a 
single core and is replicated on other cores. Is that right? The way its worded 
above, it sounds like a shard can also be sharded...


brbrbr--- Original Message ---
On 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a cluster or cloud (graph). It could be a real 
brmachine or a virtualized machine. Technically, you could have multiple 
brvirtual nodes on the same physical box. Each Solr replica would be on a 
brdifferent node.

br
brTechnically, you could have multiple Solr instances running on a single 
brhardware node, each with a different port. They are simply instances of 
brSolr, although you could consider each Solr instance a node in a Solr cloud 
bras well, a virtual node. So, technically, you could have multiple replicas 
bron the same node, but that sort of defeats most of the purpose of having 
brreplicas in the first place - to distribute the data for performance and 
brfault tolerance. But, you could have replicas of different shards on the 
brsame node/box for a partial improvement of performance and fault tolerance.

br
brA Solr cloud' is really a cluster.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:16 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brGood write up.
br
brAnd what about node?
br
brI think there needs to be an official glossary of terms that is sanctioned 
brby the solr team and some terms still ni use may need to be labeled 
brdeprecated. After so many years, its still confusing.

br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more modern 
brterm and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with 
breach

brbrcore being a replica of the other cores within that shard of that
brbrcollection.
brbr
brbrInstance is a general term, but is commonly used to refer to a running 
brSolr
brbrserver, each of which can service any number of cores. A sharded 
brcollection
brbrwould typically require multiple instances of Solr, each with a shard of 
brthe

brbrcollection.
brbr
brbrMultiple collections can be supported on a single instance of Solr. They
brbrdon't have to be sharded or replicated. But if they are, each Solr 
brinstance
brbrwill have a copy or replica of the data (index) of one shard of each 
brsharded

brbrcollection - to the degree that each collection needs that many shards.
brbr
brbrAt the API level, you talk to a Solr instance, using a host and port, 
brand
brbrgiving the collection name. Some operations will refer only to the 
brportion
brbrof a multi-shard collection on that Solr instance, but typically Solr 
brwill
brbrdistribute the operation, whether it be an update or a query, to all 
brof
brbrthe shards of the named collection. In the case of update, the update 
brwill
brbrbe distributed to all replicas as well, but in the case of query only 
brone

brbrreplica of each shard of the collection is needed.
brbr
brbrBefore SolrCloud we Solr had master and slave and the slaves were 
brreplicas
brbrof the master, but with SolrCloud there is no master and all the 
brreplicas of

brbrthe shard are peers, although at any moment of time one of them will be
brbrconsidered the leader for coordination purposes, but not in the sense 
brthat
brbrit is a master of the other replicas in that shard. A SolrCloud replica 
bris a

brbrreplica of the data, in an abstract sense, for a single shard of a
brbrcollection. A SolrCloud replica is more of an instance of the 
brdata/index.

brbr
brbrAn index exists at two levels: the portion of a collection on a single 
brSolr
brbrcore will have a Lucene index, but collectively the Lucene indexes for 
brthe
brbrshards of a collection can be referred to the index of the collection. 
brEach

brbrreplica is a copy or instance of a portion of the collection's index.
brbr
brbrThe term slice is sometimes used to refer collectively to all of the
brbrcores/replicas of a single shard, or sometimes to a single replica as it
brbrcontains only a slice of the full collection data.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Alexandre Rafalovitch

brbrSent: Thursday, January 03, 2013 4:42 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: Terminology question: Core vs. Collection vs...
brbr
brbrHello,
brbr
brbrI am trying

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Thanks. I got that part.

A group of shards (and therefore cores) represent a collection, yes. But a single shard exist only on a single core? 


brbrbr--- Original Message ---
On 1/3/2013  09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or slice) of the collection. Sharding is a way of 
brslicing the original data, before we talk about how the shards get stored 
brand replicated on actual Solr cores. Replicas are instances of the data for 
bra shard.

br
brSometimes people may loosely speak of a replica as being a shard, but 
brthat's just loose use of the terminology.

br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that the 
brbrcollection may be sharded, with each shard on one or more cores, with 
breach brcore being a replica of the other cores within that shard of that

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard 
britself is not distributed across cores in the same since. Rather a shard 
brexist on a single core and is replicated on other cores. Is that right? The 
brway its worded above, it sounds like a shard can also be sharded...

br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a 
brcluster or cloud (graph). It could be a real

brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on 
bra

brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr 
brcloud
brbras well, a virtual node. So, technically, you could have multiple 
brreplicas

brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault 
brtolerance.

brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is 
brsanctioned

brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more 
brmodern

brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores, 
brwith

brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a 
brrunning

brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a 
brshard of

brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr. 
brThey

brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many 
brshards.

brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and 
brport,

brbrand
brbrbrgiving the collection name. Some operations will refer only to the
brbrportion
brbrbrof a multi-shard collection on that Solr instance, but typically 
brSolr

brbrwill
brbrbrdistribute the operation, whether it be an update or a query, to 
brall

brbrof
brbrbrthe shards of the named collection. In the case of update, the 
brupdate

brbrwill
brbrbrbe distributed to all replicas as well, but in the case of query 
bronly

brbrone
brbrbrreplica of each shard of the collection is needed.
brbrbr
brbrbrBefore SolrCloud we Solr had master and slave and the slaves were
brbrreplicas
brbrbrof the master, but with SolrCloud there is no master and all the
brbrreplicas of
brbrbrthe shard are peers, although at any moment of time one of them will 
brbe
brbrbrconsidered the leader

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

I think what's confusing about your explanation below is when you have a 
situation where there is no replication factor. That's possible too, yes?

So in that case, is each core of a shard of a collection, still referred to as a replica? 


To me a replica is a duplicate/backup of a shard's core. Not the sharded core 
itself. Or is there just no difference. Even a non-replicated core is called a 
replica?


brbrbr--- Original Message ---
On 1/3/2013  09:08 AM Jack Krupansky wrote:brOops... let me word that a 
little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach brcore being a replica of the other cores within that shard of that
brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of
brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey
brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.
brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,
brbrand
brbrbrgiving the collection name. Some operations will refer only to the
brbrportion
brbrbrof a multi-shard collection on that Solr instance, but typically
brSolr
brbrwill
brbrbrdistribute the operation, whether it be an update

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Yes. And its worth to note that when having multiple shards in a single 
node(@deprecated) that they are shards of different collections...

brbrbr--- Original Message ---
On 1/3/2013  09:16 AM Jack Krupansky wrote:brAnd I would revise node to note that in SolrCloud a node is simply an 
brinstance of a Solr server.

br
brAnd, technically, you can have multiple shards in a single instance of Solr, 
brseparating the logical sharding of keys from the distribution of the data.

br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- 
brFrom: Jack Krupansky

brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding is a way of
brslicing the original data, before we talk about how the shards get stored
brand replicated on actual Solr cores. Replicas are instances of the data for
bra shard.
br
brSometimes people may loosely speak of a replica as being a shard, but
brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the fact that the
brbrcollection may be sharded, with each shard on one or more cores, with
breach brcore being a replica of the other cores within that shard of that
brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. A shard
britself is not distributed across cores in the same since. Rather a shard
brexist on a single core and is replicated on other cores. Is that right? The
brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have multiple
brbrvirtual nodes on the same physical box. Each Solr replica would be on
bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running on a single
brbrhardware node, each with a different port. They are simply instances of
brbrSolr, although you could consider each Solr instance a node in a Solr
brcloud
brbras well, a virtual node. So, technically, you could have multiple
brreplicas
brbron the same node, but that sort of defeats most of the purpose of having
brbrreplicas in the first place - to distribute the data for performance and
brbrfault tolerance. But, you could have replicas of different shards on the
brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be labeled
brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the more
brmodern
brbrterm and incorporates the fact that the
brbrbrcollection may be sharded, with each shard on one or more cores,
brwith
brbreach
brbrbrcore being a replica of the other cores within that shard of that
brbrbrcollection.
brbrbr
brbrbrInstance is a general term, but is commonly used to refer to a
brrunning
brbrSolr
brbrbrserver, each of which can service any number of cores. A sharded
brbrcollection
brbrbrwould typically require multiple instances of Solr, each with a
brshard of
brbrthe
brbrbrcollection.
brbrbr
brbrbrMultiple collections can be supported on a single instance of Solr.
brThey
brbrbrdon't have to be sharded or replicated. But if they are, each Solr
brbrinstance
brbrbrwill have a copy or replica of the data (index) of one shard of each
brbrsharded
brbrbrcollection - to the degree that each collection needs that many
brshards.
brbrbr
brbrbrAt the API level, you talk to a Solr instance, using a host and
brport,
brbrand
brbrbrgiving

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Ah, ok. Good. Makes sense.

I think I will draw all this up in a UML that includes the distinction between the 
logical terms and the physical terms (and their mapping) as they do get 
intertwined. I'll post it here when I'm done.

brbrbr--- Original Message ---
On 1/3/2013  09:19 AM Jack Krupansky wrote:brA single shard MAY exist on a single core, but only if it is not replicated. 
brGenerally, a single shard will exist on multiple cores, each a replica of 
brthe source data as it comes into the update handler.

br
br-- Jack Krupansky
br
br-Original Message- 
brFrom: Darren Govoni

brSent: Thursday, January 03, 2013 9:10 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks. I got that part.
br
brA group of shards (and therefore cores) represent a collection, yes. But a 
brsingle shard exist only on a single core?

br
brbrbrbr--- Original Message ---
brOn 1/3/2013  09:03 AM Jack Krupansky wrote:brNo, a shard is a subset (or 
brslice) of the collection. Sharding is a way of
brbrslicing the original data, before we talk about how the shards get 
brstored
brbrand replicated on actual Solr cores. Replicas are instances of the data 
brfor

brbra shard.
brbr
brbrSometimes people may loosely speak of a replica as being a shard, but
brbrthat's just loose use of the terminology.
brbr
brbrSo, we're not sharding shards, but we are replicating shards.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- 
brbrFrom: Darren Govoni

brbrSent: Thursday, January 03, 2013 8:51 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrThanks again. (And sorry to jump into this convo)
brbr
brbrBut I had a question on your statement:
brbr
brbrOn 1/3/2013 08:07 AM Jack Krupansky wrote:
brbr   brCollection is the more modern term and incorporates the fact that 
brthe
brbrbrcollection may be sharded, with each shard on one or more cores, 
brwith
brbreach brcore being a replica of the other cores within that shard of 
brthat

brbrbrcollection.
brbr
brbrA collection is sharded, meaning it is distributed across cores. A shard
brbritself is not distributed across cores in the same since. Rather a shard
brbrexist on a single core and is replicated on other cores. Is that right? 
brThe

brbrway its worded above, it sounds like a shard can also be sharded...
brbr
brbr
brbrbrbrbr--- Original Message ---
brbrOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a machine in a
brbrcluster or cloud (graph). It could be a real
brbrbrmachine or a virtualized machine. Technically, you could have 
brmultiple
brbrbrvirtual nodes on the same physical box. Each Solr replica would be 
bron

brbra
brbrbrdifferent node.
brbrbr
brbrbrTechnically, you could have multiple Solr instances running on a 
brsingle
brbrbrhardware node, each with a different port. They are simply instances 
brof
brbrbrSolr, although you could consider each Solr instance a node in a 
brSolr

brbrcloud
brbrbras well, a virtual node. So, technically, you could have multiple
brbrreplicas
brbrbron the same node, but that sort of defeats most of the purpose of 
brhaving
brbrbrreplicas in the first place - to distribute the data for performance 
brand
brbrbrfault tolerance. But, you could have replicas of different shards on 
brthe

brbrbrsame node/box for a partial improvement of performance and fault
brbrtolerance.
brbrbr
brbrbrA Solr cloud' is really a cluster.
brbrbr
brbrbr-- Jack Krupansky
brbrbr
brbrbr-Original Message- 
brbrbrFrom: Darren Govoni

brbrbrSent: Thursday, January 03, 2013 8:16 AM
brbrbrTo: solr-user@lucene.apache.org
brbrbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbrbr
brbrbrGood write up.
brbrbr
brbrbrAnd what about node?
brbrbr
brbrbrI think there needs to be an official glossary of terms that is
brbrsanctioned
brbrbrby the solr team and some terms still ni use may need to be labeled
brbrbrdeprecated. After so many years, its still confusing.
brbrbr
brbrbrbrbrbr--- Original Message ---
brbrbrOn 1/3/2013  08:07 AM Jack Krupansky wrote:brCollection is the 
brmore

brbrmodern
brbrbrterm and incorporates the fact that the
brbrbrbrcollection may be sharded, with each shard on one or more cores,
brbrwith
brbrbreach
brbrbrbrcore being a replica of the other cores within that shard of 
brthat

brbrbrbrcollection.
brbrbrbr
brbrbrbrInstance is a general term, but is commonly used to refer to a
brbrrunning
brbrbrSolr
brbrbrbrserver, each of which can service any number of cores. A sharded
brbrbrcollection
brbrbrbrwould typically require multiple instances of Solr, each with a
brbrshard of
brbrbrthe
brbrbrbrcollection.
brbrbrbr
brbrbrbrMultiple collections can be supported on a single instance of 
brSolr.

brbrThey
brbrbrbrdon't have to be sharded or replicated. But if they are, each 
brSolr

brbrbrinstance
brbrbrbrwill have a copy or replica of the data (index) of one

RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

Great point.

brbrbr--- Original Message ---
On 1/3/2013  10:42 AM Per Steffensen wrote:brOn 1/3/13 4:33 PM, Mark Miller 
wrote:
br This has pretty much become the standard across other distributed systems 
and in the literat…err…books.
brHmmm Im not sure you are right about that. Maybe more than one 
brdistributed system calls them Replica, but there is also a lot that 
brdoesnt. But if you are right, thats at least a good valid argument to do 
brit this way, even though I generally prefer good logical naming over 
brfollowing bad naming from the industry :-) Just because there is a lot 
brof crap out there, doesnt mean that we also want to make crap. Maybe 
brgood logical naming could even be a small entry in the Why Solr is 
brbetter than its competitors list :-)

br


RE: Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni

And based on the previous explanation there is never a copy of a shard. A 
shard represents and contains only replicas for itself, replicas being copies of cores 
within the shard.

brbrbr--- Original Message ---
On 1/3/2013  11:58 AM Walter Underwood wrote:brA factor is multiplied, so 
multiplying the leader by a replicationFactor of 1 means you have exactly one copy of that 
shard.
br
brI think that recycling the term replication within Solr was confusing, but it is a bit late to change that. 
br

brwunder
br
brOn Jan 3, 2013, at 7:33 AM, Mark Miller wrote:
br
br This has pretty much become the standard across other distributed systems 
and in the literat…err…books.
br 
br I first implemented it as you mention you'd like, but Yonik correctly pointed out that we were going against the grain.
br 
br - Mark
br 
br On Jan 3, 2013, at 10:01 AM, Per Steffensen st...@designware.dk wrote:
br 
br For the same reasons that Replica shouldnt be called Replica (it requires to long an explanation to agree that it is an ok name), replicationFactor shouldnt be called replicationFactor and long as it referes to the TOTAL number of cores you get for your Shard. replicationFactor would be an ok name if replicationFactor=0 meant one core, replicationFactor=1 meant two cores etc., but as long as replicationFactor=1 means one core, replicationFactor=2 means two cores, it is bad naming (you will not get any replication with replicationFactor=1 - WTF!?!?). If we want to insist that you specify the total number of cores at least use replicaPerShard instead of replicationFactor, or even better rename Replica to Shard-instance and use instancesPerShard instead of replicationFactor.
br 
br Regards, Per Steffensen
br 
br On 1/3/13 3:52 PM, Per Steffensen wrote:

br Hi
br 
br Here is my version - do not believe the explanations have been very clear
br 
br We have the following concepts (here I will try to explain what each the concept cover without naming it - its hard)

br 1) Machines (virtual or physical) running Solr server JVMs (one machine 
can run several Solr server JVMs if you like)
br 2) Solr server JVMs
br 3) Logical stores where you can add/update/delete data-instances (closest to 
logical tables in RDBMS)
br 4) Logical slices of a store (closest to non-overlapping logical sets of rows 
for the logical table in a RDBMS)
br 5) Physical instances of slices (a physical (disk/memory) instance of the a logical 
slice). This is where data actually goes on disk - the logical stores and slices above are just non-physical 
concepts
br 
br Terminology

br 1) Believe we have no name for this (except of course machine :-) ), even though Jack claims that this is 
called a node. Maybe sometimes it is called a node, but I believe node is more often used to refer 
to a Solr server JVM.
br 2) Node
br 3) Collection
br 4) Shard. Used to be called Slice but I believe now it is officially called 
Shard. I agree with that change, because I believe most of the industry also uses the term Shard for this 
logical/non-physical concept  - just needs to be reflected it across documentation and code
br 5) Replica. Used to be called Shard but I believe now it is officially called Replica. I certainly do not agree with the name 
Replica, because it suggests that it is a copy of an original, but it isnt. I would prefer Shard-instance here, to avoid the confusion. I understand that you can argue 
(if you argue long enough) that Replica is a fine name, but you really need the explanation to understand why Replica can be defended as the name for this. Is is not immediately 
obvious what this is as long as it is called Replica. A Replica is basically a Solr Cloud managed Core and behind every Replica/Core lives a physical Lucene index. So Replica=Core) 
contains/maintains Lucene index behind the scenes. The term Replica also needs to be reflected across documentation and code.
br 
br Regards, Per Steffensen
br 
br 
br

br--
brWalter Underwood
brwun...@wunderwood.org
br
br
br
br


Re: Terminology question: Core vs. Collection vs...

2013-01-03 Thread Darren Govoni
I see. So sharding and distributing/replicating can have separate and 
different advantages.


On 01/03/2013 01:06 PM, Lance Norskog wrote:
Also, searching can be much faster if you put all of the shards on one 
machine, and the search distributor. That way, you search with 
multiple simultaneous threads inside one machine. I've seen this make 
searches several times faster.


On 01/03/2013 06:36 AM, Jack Krupansky wrote:
Ah... the multiple shards (of the same collection) in a single node 
is about planning for future expansion of your cluster - create more 
shards than you need today, put more of them on a single node and 
then migrate them to their own nodes as the data outgrows the smaller 
number of nodes. In other words, add nodes incrementally without 
having to reindex all the data.


-- Jack Krupansky

-Original Message- From: Darren Govoni
Sent: Thursday, January 03, 2013 9:18 AM
To: solr-user@lucene.apache.org
Subject: RE: Re: Terminology question: Core vs. Collection vs...

Yes. And its worth to note that when having multiple shards in a 
single node(@deprecated) that they are shards of different 
collections...


brbrbr--- Original Message ---
On 1/3/2013  09:16 AM Jack Krupansky wrote:brAnd I would revise 
node to note that in SolrCloud a node is simply an

brinstance of a Solr server.
br
brAnd, technically, you can have multiple shards in a single 
instance of Solr,
brseparating the logical sharding of keys from the distribution of 
the data.

br
br-- Jack Krupansky
br
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:08 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brOops... let me word that a little more carefully:
br
br...we are replicating the data of each shard.
br
br
br
br
br
br-- Jack Krupansky
br-Original Message- brFrom: Jack Krupansky
brSent: Thursday, January 03, 2013 9:03 AM
brTo: solr-user@lucene.apache.org
brSubject: Re: Terminology question: Core vs. Collection vs...
br
brNo, a shard is a subset (or slice) of the collection. Sharding 
is a way of
brslicing the original data, before we talk about how the shards 
get stored
brand replicated on actual Solr cores. Replicas are instances of 
the data for

bra shard.
br
brSometimes people may loosely speak of a replica as being a 
shard, but

brthat's just loose use of the terminology.
br
brSo, we're not sharding shards, but we are replicating shards.
br
br-- Jack Krupansky
br
br-Original Message- brFrom: Darren Govoni
brSent: Thursday, January 03, 2013 8:51 AM
brTo: solr-user@lucene.apache.org
brSubject: RE: Re: Terminology question: Core vs. Collection vs...
br
brThanks again. (And sorry to jump into this convo)
br
brBut I had a question on your statement:
br
brOn 1/3/2013 08:07 AM Jack Krupansky wrote:
br   brCollection is the more modern term and incorporates the 
fact that the
brbrcollection may be sharded, with each shard on one or more 
cores, with
breach brcore being a replica of the other cores within that 
shard of that

brbrcollection.
br
brA collection is sharded, meaning it is distributed across cores. 
A shard
britself is not distributed across cores in the same since. Rather 
a shard
brexist on a single core and is replicated on other cores. Is that 
right? The

brway its worded above, it sounds like a shard can also be sharded...
br
br
brbrbrbr--- Original Message ---
brOn 1/3/2013  08:28 AM Jack Krupansky wrote:brA node is a 
machine in a

brcluster or cloud (graph). It could be a real
brbrmachine or a virtualized machine. Technically, you could have 
multiple
brbrvirtual nodes on the same physical box. Each Solr replica 
would be on

bra
brbrdifferent node.
brbr
brbrTechnically, you could have multiple Solr instances running 
on a single
brbrhardware node, each with a different port. They are simply 
instances of
brbrSolr, although you could consider each Solr instance a node 
in a Solr

brcloud
brbras well, a virtual node. So, technically, you could have 
multiple

brreplicas
brbron the same node, but that sort of defeats most of the 
purpose of having
brbrreplicas in the first place - to distribute the data for 
performance and
brbrfault tolerance. But, you could have replicas of different 
shards on the

brbrsame node/box for a partial improvement of performance and fault
brtolerance.
brbr
brbrA Solr cloud' is really a cluster.
brbr
brbr-- Jack Krupansky
brbr
brbr-Original Message- brbrFrom: Darren Govoni
brbrSent: Thursday, January 03, 2013 8:16 AM
brbrTo: solr-user@lucene.apache.org
brbrSubject: RE: Re: Terminology question: Core vs. Collection vs...
brbr
brbrGood write up.
brbr
brbrAnd what about node?
brbr
brbrI think there needs to be an official glossary of terms that is
brsanctioned
brbrby the solr team and some terms still ni use may need to be 
labeled

brbrdeprecated. After so many years, its still confusing.
brbr
brbrbrbrbr--- Original Message

RE: Does SolrCloud supports MoreLikeThis?

2012-11-05 Thread Darren Govoni

There is a ticket for that with some recent activity (sorry I don't have it 
handy right now), but I'm not sure if that work made it into the trunk, so 
probably solrcloud does not support MLT...yet. Would love an update from the 
dev team though!

brbrbr--- Original Message ---
On 11/5/2012  10:37 AM Luis Cappa Banda wrote:brThat´s the question, :-)
br
brRegards,
br
brLuis Cappa.
br


Re: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and Realtime NRT available for download

2012-10-29 Thread Darren Govoni
It certainly seems to be a rogue project, but I can't understand the 
meaning of realtime near realtime (NRT) either. At best, its oxymoronic.



On 10/29/2012 10:30 AM, Jack Krupansky wrote:
Could any of the committers here confirm whether this is a legitimate 
effort? I mean, how could anything labeled Apache ABC with XYZ be an 
external project and be sanctioned/licensed by Apache? In fact, the 
linked web page doesn't even acknowledge the ownership of the Apache 
trademarks or ASL. And the term Realtime NRT is nonsensical. Even 
worse: Realtime NRT makes available a near realtime view. Equally 
nonsensical. Who knows, maybe it is legit, but it sure comes across as 
a scam/spam.


-- Jack Krupansky

-Original Message- From: Nagendra Nagarajayya
Sent: Monday, October 29, 2012 10:06 AM
To: solr-user@lucene.apache.org
Subject: [Announce] Apache Solr 4.0 with RankingAlgorithm 1.4.4 and 
Realtime NRT available for download


Hi!

I am very excited to announce the availability of Apache Solr 4.0 with
RankingAlgorithm 1.4.4 and Realtime NRT. Realtime NRT is a high
performance and more granular NRT implementation as to soft commit. The
update performance is about 70,000 documents / sec* (almost 1.5-2x
performance improvement over soft-commit). You can also scale up to 2
billion documents* in a single core, and query half a billion documents
index in ms**. Realtime NRT is different from realtime-get. realtime-get
does not have search capability and is a lookup by id. Realtime NRT
allows full search, see here http://solr-ra.tgels.org/realtime-nrt.jsp
for more info.

Realtime NRT has been contributed back to Solr, see JIRA:
https://issues.apache.org/jira/browse/SOLR-3816

RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or
boolean/dismax/boost queries and is compatible with the new Lucene 4.0 
api.


You can get more information about Solr 4.0 with RankingAlgorithm 1.4.4
and Realtime NRT performance from here:
http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x

You can download Solr 4.0 with RankingAlgorithm 1.4.4 from here:
http://solr-ra.tgels.org

Please download and give the new version a try.

Note:
1. Apache Solr 4.0 with RankingAlgorithm 1.4.4 is an external project

Regards,

Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

* performance is a real use case of Apache Solr with RankingAlgorithm as
seen at a user installation
** performance seen when using the age feature











Re: Cloud terminology clarification

2012-09-09 Thread Darren Govoni
I agree it needs updating and I've always gotten confused at some point
by
the use (misuse) of terms.

For example, the term 'node' is thrown around a lot too. What is it??!
Hehe.

On Sat, 2012-09-08 at 22:26 -0700, JesseBuesking wrote:

 It's been a while since the terminology at
 http://wiki.apache.org/solr/SolrTerminology has been updated, so I'm
 wondering how these terms apply to solr cloud setups.
 
 My take on what the terms mean:
 
 Collection: Basically the highest level container that bundles together the
 other pieces for servicing a particular search setup
 Core: An individual solr instance (represents entire indexes)
 Shard: A portion of a core (represents a subset of an index)
 
 Therefore:
 - increasing the number of shards allows for indexing more documents (aka
 scaling the amount of data that can be indexed)
 - increasing the number of cores increases the potential throughput of
 requests (aka cores mirror each other allowing you to distribute requests to
 multiple servers)
 
 Does this sound right?
 
 If so, then my follow up question would be does the following directory
 structure look right/standard?
 
 .../solr # = solr home
 .../solr/collection-01
 .../solr/collection-01/core-01
 .../solr/collection-01/core-02
 
 And if this is right, I'm on a roll :D
 
 My next question would then be:
 Given we're using zookeeper (separate machine), do we need 1 conf folder at
 collection-01's level?  Or do we need 1 conf folder per core?
 
 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Cloud-terminology-clarification-tp4006407.html
 Sent from the Solr - User mailing list archive at Nabble.com.




Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Darren Govoni
Of course you can do it, but the question is whether this will produce
the performance results you expect.
I've seen talk about this in other forums, so you might find some prior
work here.

Solr and HDFS serve somewhat different purposes. The key issue would be
if your map and reduce code
overloads the Solr endpoint. Even using SolrCloud, I believe all
requests will have to go through a single
URL (to be routed), so if you have thousands of map/reduce jobs all
running simultaneously, the question is whether
your Solr is architected to handle that amount of throughput.


On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:

 Is it possible to run map reduce jobs directly on Solr4?
 
 I'm asking this because I want to use Solr4 as the primary storage engine.
 And I want to be able to run near real time analytics against it as well.
 Rather than export solr4 data out to a hadoop cluster.




Re: Map/Reduce directly against solr4 index.

2012-07-26 Thread Darren Govoni
You raise an interesting possibility. A map/reduce solr handler over
solrcloud...

On Thu, 2012-07-26 at 18:52 -0700, Trung Pham wrote:

 I think the performance should be close to Hadoop running on HDFS, if
 somehow Hadoop job can directly read the Solr Index file while executing
 the job on the local solr node.
 
 Kindna like how HBase and Cassadra integrate with Hadoop.
 
 Plus, we can run the map reduce job on a standby Solr4 cluster.
 
 This way, the documents in Solr will be our primary source of truth. And we
 have the ability to run near real time search queries and analytics on it.
 No need to export data around.
 
 Solr4 is becoming a very interesting solution to many web scale problems.
 Just missing the map/reduce component. :)
 
 On Thu, Jul 26, 2012 at 3:01 PM, Darren Govoni dar...@ontrenet.com wrote:
 
  Of course you can do it, but the question is whether this will produce
  the performance results you expect.
  I've seen talk about this in other forums, so you might find some prior
  work here.
 
  Solr and HDFS serve somewhat different purposes. The key issue would be
  if your map and reduce code
  overloads the Solr endpoint. Even using SolrCloud, I believe all
  requests will have to go through a single
  URL (to be routed), so if you have thousands of map/reduce jobs all
  running simultaneously, the question is whether
  your Solr is architected to handle that amount of throughput.
 
 
  On Thu, 2012-07-26 at 14:55 -0700, Trung Pham wrote:
 
   Is it possible to run map reduce jobs directly on Solr4?
  
   I'm asking this because I want to use Solr4 as the primary storage
  engine.
   And I want to be able to run near real time analytics against it as well.
   Rather than export solr4 data out to a hadoop cluster.
 
 
 




Re: [Announce] Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 with Realtime NRT available for download

2012-07-22 Thread Darren Govoni
What exactly is Realtime NRT (Near Real Time)?

On Sun, 2012-07-22 at 14:07 -0700, Nagendra Nagarajayya wrote:

 Hi!
 
 I am very excited to announce the availability of Solr 4.0-ALPHA with 
 RankingAlgorithm 1.4.4 with Realtime NRT. The Realtime NRT 
 implementation now supports both RankingAlgorithm and Lucene. Realtime 
 NRT is a high performance and more granular NRT implementation as to 
 soft commit. The update performance is about 70,000 documents / sec*. 
 You can also scale up to 2 billion documents* in a single core, and 
 query half a billion documents index in ms**.
 
 RankingAlgorithm 1.4.4 supports the entire Lucene Query Syntax, ± and/or 
 boolean queries and is compatible with the new Lucene 4.0-ALPHA api.
 
 You can get more information about Solr 4.0-ALPHA with RankingAlgorithm 
 1.4.4 Realtime performance from here:
 http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_4.x
 
 You can download Solr 4.0-ALPHA with RankingAlgorithm 1.4.4 from here:
 http://solr-ra.tgels.org
 
 Please download and give the new version a try.
 
 Regards,
 
 Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org
 
 * performance seen at a user installation of Solr 4.0 with 
 RankingAlgorithm 1.4.3
 ** performance seen when using the age feature
 




Re: Facet on all the dynamic fields with *_s feature

2012-07-16 Thread Darren Govoni
You'll have to query the index for the fields and sift out the _s ones
and cache them or something.

On Mon, 2012-07-16 at 16:52 +0530, Rajani Maski wrote:

 Yes, This feature will solve the below problem very neatly.
 
 All,
 
  Is there any approach to achieve this for now?
 
 
 --Rajani
 
 On Sun, Jul 15, 2012 at 6:02 PM, Jack Krupansky 
 j...@basetechnology.comwrote:
 
  The answer appears to be No, but it's good to hear people express an
  interest in proposed features.
 
  -- Jack Krupansky
 
  -Original Message- From: Rajani Maski
  Sent: Sunday, July 15, 2012 12:02 AM
  To: solr-user@lucene.apache.org
  Subject: Facet on all the dynamic fields with *_s feature
 
 
  Hi All,
 
Is this issue fixed in solr 3.6 or 4.0:  Faceting on all Dynamic field
  with facet.field=*_s
 
Link  :  
  https://issues.apache.org/**jira/browse/SOLR-247https://issues.apache.org/jira/browse/SOLR-247
 
 
 
   If it is not fixed, any suggestion on how do I achieve this?
 
 
  My requirement is just same as this one :
  http://lucene.472066.n3.**nabble.com/Dynamic-facet-**
  field-tc2979407.html#nonehttp://lucene.472066.n3.nabble.com/Dynamic-facet-field-tc2979407.html#none
 
 
  Regards
  Rajani
 




Re: Solr Faceting

2012-07-07 Thread Darren Govoni
I don't think it comes at any added cost for solr to return that facet
so you can filter it
out in your business logic.

On Sat, 2012-07-07 at 15:18 +0530, Shanu Jha wrote:

 Hi,
 
 
 I am generating facet for a field which has one of the value NA and I
 want solr should not create facet(or ignore) for this(NA) value. Is there
 any way to in solr to do that.
 
 Thanks




Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-28 Thread Darren Govoni
I don't recall anyone being able to get acceptable performance with a
single index that large with solr/lucene. The conventional wisdom is
that parallel searching across cores (or shards in SolrCloud) is the
best way to handle index sizes in the illions. So its of great
interest how you did.

Anyone else gotten an index(es) with billions of documents to perform
well? I'm greatly interested in how.

On Mon, 2012-05-28 at 05:12 -0700, Nagendra Nagarajayya wrote:
 It is a single node. I am trying to find out if the performance can be 
 referenced.
 
 Regarding information on Solr with RankingAlgorithm, you can find all 
 the information here:
 
 http://solr-ra.tgels.org
 
 On RankingAlgorithm:
 
 http://rankingalgorithm.tgels.org
 
 Regards,
 - NN
 
 On 5/27/2012 4:50 PM, Li Li wrote:
  yes, I am also interested in good performance with 2 billion docs. how
  many search nodes do you use? what's the average response time and qps
  ?
 
  another question: where can I find related paper or resources of your
  algorithm which explains the algorithm in detail? why it's better than
  google site(better than lucene is not very interested because lucene
  is not originally designed to provide search function like google)?
 
  On Mon, May 28, 2012 at 1:06 AM, Darren Govonidar...@ontrenet.com  wrote:
  I think people on this list would be more interested in your approach to
  scaling 2 billion documents than modifying solr/lucene scoring (which is
  already top notch). So given that, can you share any references or
  otherwise substantiate good performance with 2 billion documents?
 
  Thanks.
 
  On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote:
  Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion
  docs. With RankingAlgorithm 1.4.3, using the parameters
  age=latestdocs=number feature, you can retrieve the NRT inserted
  documents in milliseconds from such a huge index improving query and
  faceting performance and using very little resources ...
 
  Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and
  the NRT insert performance with Solr 4.0 is about 70,000 docs / sec.
  RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
 
 
  On 5/27/2012 7:32 AM, Darren Govoni wrote:
  Hi,
  Have you tested this with a billion documents?
 
  Darren
 
  On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
  Hi!
 
  I am very excited to announce the availability of Solr 3.6 with
  RankingAlgorithm 1.4.2.
 
  This NRT supports now works with both RankingAlgorithm and Lucene. The
  insert/update performance should be about 5000 docs in about 490 ms with
  the MbArtists Index.
 
  RankingAlgorithm 1.4.2 has multiple algorithms, improved performance
  over the earlier releases, supports the entire Lucene Query Syntax, ±
  and/or boolean queries and can scale to more than a billion documents.
 
  You can get more information about NRT performance from here:
  http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
 
  You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
  http://solr-ra.tgels.org
 
  Please download and give the new version a try.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
  ps. MbArtists index is the example index used in the Solr 1.4 Enterprise
  Book
 
 
 
 
 
 




Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-27 Thread Darren Govoni
Hi,
  Have you tested this with a billion documents?

Darren

On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
 Hi!
 
 I am very excited to announce the availability of Solr 3.6 with 
 RankingAlgorithm 1.4.2.
 
 This NRT supports now works with both RankingAlgorithm and Lucene. The 
 insert/update performance should be about 5000 docs in about 490 ms with 
 the MbArtists Index.
 
 RankingAlgorithm 1.4.2 has multiple algorithms, improved performance 
 over the earlier releases, supports the entire Lucene Query Syntax, ± 
 and/or boolean queries and can scale to more than a billion documents.
 
 You can get more information about NRT performance from here:
 http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
 
 You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
 http://solr-ra.tgels.org
 
 Please download and give the new version a try.
 
 Regards,
 
 Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org
 
 ps. MbArtists index is the example index used in the Solr 1.4 Enterprise 
 Book
 




Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-27 Thread Darren Govoni
I think people on this list would be more interested in your approach to
scaling 2 billion documents than modifying solr/lucene scoring (which is
already top notch). So given that, can you share any references or
otherwise substantiate good performance with 2 billion documents?

Thanks.

On Sun, 2012-05-27 at 08:29 -0700, Nagendra Nagarajayya wrote:
 Actually, RankingAlgorithm 1.4.2 has been scaled to more than 2 billion 
 docs. With RankingAlgorithm 1.4.3, using the parameters 
 age=latestdocs=number feature, you can retrieve the NRT inserted 
 documents in milliseconds from such a huge index improving query and 
 faceting performance and using very little resources ...
 
 Currently, RankingAlgorithm 1.4.3 is only available with Solr 4.0, and 
 the NRT insert performance with Solr 4.0 is about 70,000 docs / sec. 
 RankingAlgorithm 1.4.3 should become available with Solr 3.6 soon.
 
 Regards,
 
 Nagendra Nagarajayya
 http://solr-ra.tgels.org
 http://rankingalgorithm.tgels.org
 
 
 
 On 5/27/2012 7:32 AM, Darren Govoni wrote:
  Hi,
 Have you tested this with a billion documents?
 
  Darren
 
  On Sun, 2012-05-27 at 07:24 -0700, Nagendra Nagarajayya wrote:
  Hi!
 
  I am very excited to announce the availability of Solr 3.6 with
  RankingAlgorithm 1.4.2.
 
  This NRT supports now works with both RankingAlgorithm and Lucene. The
  insert/update performance should be about 5000 docs in about 490 ms with
  the MbArtists Index.
 
  RankingAlgorithm 1.4.2 has multiple algorithms, improved performance
  over the earlier releases, supports the entire Lucene Query Syntax, ±
  and/or boolean queries and can scale to more than a billion documents.
 
  You can get more information about NRT performance from here:
  http://solr-ra.tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
 
  You can download Solr 3.6 with RankingAlgorithm 1.4.2 from here:
  http://solr-ra.tgels.org
 
  Please download and give the new version a try.
 
  Regards,
 
  Nagendra Nagarajayya
  http://solr-ra.tgels.org
  http://rankingalgorithm.tgels.org
 
  ps. MbArtists index is the example index used in the Solr 1.4 Enterprise
  Book
 
 
 
 
 




SolrCloud war context name?

2012-05-26 Thread Darren Govoni
Hi,
 I am running my solrcloud nodes in an app server deployed into the
context path 'solr' and zookeeper sees all of them. I want to deploy a
second solrcloud war into the same app server (thus same IP:port) in a
different context like 'solrrep' with the same config (cloned).

Will this work? Or does zookeeper (or solrcloud leader) require all
connected solr shards to have context url with ip:port/solr? Or will the
correct URL be registered from the replica shard?

thanks!




Re: SolrCloud war context name?

2012-05-26 Thread Darren Govoni
It's not really clear from the wiki how to use cores as shard replicas
within the same solr server. In my mind, having a separate JVM/solr
node/ acting as a replica makes sense because the replication traffic
will be on a different channel in a different vm and won't interfere
with search/indexing traffic on the primary shards.

Or am I missing something easy about using cores with solr cloud? 
It was mentioned on the list recently that managing cores with solrcloud
isn't really the best practice for it.

On Sat, 2012-05-26 at 16:12 -0300, Marcelo Carvalho Fernandes wrote:
 Why not using multicore?
 
 
 Marcelo Carvalho Fernandes
 +55 21 8272-7970
 
 
 
 On Sat, May 26, 2012 at 12:56 PM, Darren Govoni ontre...@ontrenet.comwrote:
 
  Hi,
   I am running my solrcloud nodes in an app server deployed into the
  context path 'solr' and zookeeper sees all of them. I want to deploy a
  second solrcloud war into the same app server (thus same IP:port) in a
  different context like 'solrrep' with the same config (cloned).
 
  Will this work? Or does zookeeper (or solrcloud leader) require all
  connected solr shards to have context url with ip:port/solr? Or will the
  correct URL be registered from the replica shard?
 
  thanks!
 
 
 




RE: Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-22 Thread Darren Govoni

I'm curious what the solrcloud experts say, but my suggestion is to try not to 
over-engineering the search architecture  on solrcloud. For example, what is 
the benefit of managing the what cores are indexed and searched? Having to know 
those details, in my mind, works against the automation in solrcore, but maybe 
there's a good reason you want to do it this way.

brbrbr--- Original Message ---
On 5/22/2012  07:35 AM Yandong Yao wrote:brHi Darren,
br
brThanks very much for your reply.
br
brThe reason I want to control core indexing/searching is that I want to
bruse one core to store one customer's data (all customer share same
brconfig):  such as customer 1 use coreForCustomer1 and customer 2
bruse coreForCustomer2.
br
brIs there any better way than using different core for different customer?
br
brAnother way maybe use different collection for different customer, while
brnot sure how many collections solr cloud could support. Which way is better
brin terms of flexibility/scalability? (suppose there are tens of thousands
brcustomers).
br
brRegards,
brYandong
br
br2012/5/22 Darren Govoni dar...@ontrenet.com
br
br Why do you want to control what gets indexed into a core and then
br knowing what core to search? That's the kind of knowing that SolrCloud
br solves. In SolrCloud, it handles the distribution of documents across
br shards and retrieves them regardless of which node is searched from.
br That is the point of cloud, you don't know the details of where
br exactly documents are being managed (i.e. they are cloudy). It can
br change and re-balance from time to time. SolrCloud performs the
br distributed search for you, therefore when you try to search a node/core
br with no documents, all the results from the cloud are retrieved
br regardless. This is considered A Good Thing.
br
br It requires a change in thinking about indexing and searching
br
br On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
br  Hi Guys,
br 
br  I use following command to start solr cloud according to solr cloud 
wiki.
br 
br  yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
br  -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
br  yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983
br -jar
br  start.jar
br 
br  Then I have created several cores using CoreAdmin API (
br  http://localhost:8983/solr/admin/cores?action=CREATEname=
br  coreNamecollection=collection1), and clusterstate.json show following
br  topology:
br 
br 
br  collection1:
br  -- shard1:
br-- collection1
br-- CoreForCustomer1
br-- CoreForCustomer3
br-- CoreForCustomer5
br  -- shard2:
br-- collection1
br-- CoreForCustomer2
br-- CoreForCustomer4
br 
br 
br  1) Index:
br 
br  Using following command to index mem.xml file in exampledocs directory.
br 
br  yydzero:exampledocs bjcoe$ java -Durl=
br  http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
br  SimplePostTool: version 1.4
br  SimplePostTool: POSTing files to
br  http://localhost:8983/solr/coreForCustomer3/update..
br  SimplePostTool: POSTing file mem.xml
br  SimplePostTool: COMMITting Solr index changes.
br 
br  And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
br  'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
br  core has 0 documents.
br 
br  *Question 1:*  Is this expected behavior? How do I to index documents
br into
br  a specific core?
br 
br  *Question 2*:  If SolrCloud don't support this yet, how could I extend 
it
br  to support this feature (index document to particular core), where
br should i
br  start, the hashing algorithm?
br 
br  *Question 3*:  Why the documents are also indexed into 
'coreForCustomer1'
br  and 'coreForCustomer5'?  The default replica for documents are 1, right?
br 
br  Then I try to index some document to 'coreForCustomer2':
br 
br  $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
br  post.jar ipod_video.xml
br 
br  While 'coreForCustomer2' still have 0 documents and documents in
br ipod_video
br  are indexed to core for customer 1/3/5.
br 
br  *Question 4*:  Why this happens?
br 
br  2) Search: I use 
br  http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to
br  search against 'CoreForCustomer2', while it will return all documents in
br  the whole collection even though this core has no documents at all.
br 
br  Then I use 
br 
br 
http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2
br ,
br  and it will return 0 documents.
br 
br  *Question 5*: So If want to search against a particular core, we need to
br  use 'shards' parameter and use solrCore name as parameter value, right?
br 
br 
br  Thanks very much in advance!
br 
br  Regards,
br  Yandong
br
br
br
br


Re: SolrCloud: how to index documents into a specific core and how to search against that core?

2012-05-21 Thread Darren Govoni
Why do you want to control what gets indexed into a core and then
knowing what core to search? That's the kind of knowing that SolrCloud
solves. In SolrCloud, it handles the distribution of documents across
shards and retrieves them regardless of which node is searched from.
That is the point of cloud, you don't know the details of where
exactly documents are being managed (i.e. they are cloudy). It can
change and re-balance from time to time. SolrCloud performs the
distributed search for you, therefore when you try to search a node/core
with no documents, all the results from the cloud are retrieved
regardless. This is considered A Good Thing.

It requires a change in thinking about indexing and searching

On Tue, 2012-05-22 at 08:43 +0800, Yandong Yao wrote:
 Hi Guys,
 
 I use following command to start solr cloud according to solr cloud wiki.
 
 yydzero:example bjcoe$ java -Dbootstrap_confdir=./solr/conf
 -Dcollection.configName=myconf -DzkRun -DnumShards=2 -jar start.jar
 yydzero:example2 bjcoe$ java -Djetty.port=7574 -DzkHost=localhost:9983 -jar
 start.jar
 
 Then I have created several cores using CoreAdmin API (
 http://localhost:8983/solr/admin/cores?action=CREATEname=
 coreNamecollection=collection1), and clusterstate.json show following
 topology:
 
 
 collection1:
 -- shard1:
   -- collection1
   -- CoreForCustomer1
   -- CoreForCustomer3
   -- CoreForCustomer5
 -- shard2:
   -- collection1
   -- CoreForCustomer2
   -- CoreForCustomer4
 
 
 1) Index:
 
 Using following command to index mem.xml file in exampledocs directory.
 
 yydzero:exampledocs bjcoe$ java -Durl=
 http://localhost:8983/solr/coreForCustomer3/update -jar post.jar mem.xml
 SimplePostTool: version 1.4
 SimplePostTool: POSTing files to
 http://localhost:8983/solr/coreForCustomer3/update..
 SimplePostTool: POSTing file mem.xml
 SimplePostTool: COMMITting Solr index changes.
 
 And now SolrAdmin UI shows that 'coreForCustomer1', 'coreForCustomer3',
 'coreForCustomer5' has 3 documents (mem.xml has 3 documents) and other 2
 core has 0 documents.
 
 *Question 1:*  Is this expected behavior? How do I to index documents into
 a specific core?
 
 *Question 2*:  If SolrCloud don't support this yet, how could I extend it
 to support this feature (index document to particular core), where should i
 start, the hashing algorithm?
 
 *Question 3*:  Why the documents are also indexed into 'coreForCustomer1'
 and 'coreForCustomer5'?  The default replica for documents are 1, right?
 
 Then I try to index some document to 'coreForCustomer2':
 
 $ java -Durl=http://localhost:8983/solr/coreForCustomer2/update -jar
 post.jar ipod_video.xml
 
 While 'coreForCustomer2' still have 0 documents and documents in ipod_video
 are indexed to core for customer 1/3/5.
 
 *Question 4*:  Why this happens?
 
 2) Search: I use 
 http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xml; to
 search against 'CoreForCustomer2', while it will return all documents in
 the whole collection even though this core has no documents at all.
 
 Then I use 
 http://localhost:8983/solr/coreForCustomer2/select?q=*%3A*wt=xmlshards=localhost:8983/solr/coreForCustomer2;,
 and it will return 0 documents.
 
 *Question 5*: So If want to search against a particular core, we need to
 use 'shards' parameter and use solrCore name as parameter value, right?
 
 
 Thanks very much in advance!
 
 Regards,
 Yandong




Re: Distributed search between solrclouds?

2012-05-18 Thread Darren Govoni
The thought here is to distribute a search between two different
solrcloud clusters and get ordered ranked results between them.
It's possible?

On Tue, 2012-05-15 at 18:47 -0400, Darren Govoni wrote:
 Hi,
   Would distributed search (the old way where you provide the solr host
 IP's etc.) still work between different solrclouds?
 
 thanks,
 Darren
 




Distributed search between solrclouds?

2012-05-15 Thread Darren Govoni
Hi,
  Would distributed search (the old way where you provide the solr host
IP's etc.) still work between different solrclouds?

thanks,
Darren



Re: Documents With large number of fields

2012-05-13 Thread Darren Govoni
Was there a response to this? 

On Fri, 2012-05-04 at 10:27 -0400, Keswani, Nitin - BLS CTR wrote:
 Hi,
 
 My data model consist of different types of data. Each data type has its own 
 characteristics
 
 If I include the unique characteristics of each type of data, my single Solr 
 Document could end up containing 300-400 fields.
 
 In order to drill down to this data set I would have to provide faceting on 
 most of these fields so that I can drilldown to very small set of
 Documents.
 
 Here are some of the questions :
 
 1) What's the best approach when dealing with documents with large number of 
 fields .
 Should I keep a single document with large number of fields or split my
 document into a number of smaller  documents where each document would 
 consist of some fields
 
 2) From an operational point of view, what's the drawback of having a single 
 document with a very large number of fields.
 Can Solr support documents with large number of fields (say 300 to 400).
 
 
 Thanks.
 
 Regards,
 
 Nitin Keswani
 




Re: Documents With large number of fields

2012-05-04 Thread Darren Govoni
I'm also interested in this. Same situation.

On Fri, 2012-05-04 at 10:27 -0400, Keswani, Nitin - BLS CTR wrote:
 Hi,
 
 My data model consist of different types of data. Each data type has its own 
 characteristics
 
 If I include the unique characteristics of each type of data, my single Solr 
 Document could end up containing 300-400 fields.
 
 In order to drill down to this data set I would have to provide faceting on 
 most of these fields so that I can drilldown to very small set of
 Documents.
 
 Here are some of the questions :
 
 1) What's the best approach when dealing with documents with large number of 
 fields .
 Should I keep a single document with large number of fields or split my
 document into a number of smaller  documents where each document would 
 consist of some fields
 
 2) From an operational point of view, what's the drawback of having a single 
 document with a very large number of fields.
 Can Solr support documents with large number of fields (say 300 to 400).
 
 
 Thanks.
 
 Regards,
 
 Nitin Keswani
 




SolrCloud indexing question

2012-04-20 Thread Darren Govoni
Hi,
  I just wanted to make sure I understand how distributed indexing works
in solrcloud.

Can I index locally at each shard to avoid throttling a central port? Or
all the indexing has to go through a single shard leader?

thanks




Re: SolrCloud indexing question

2012-04-20 Thread Darren Govoni
Gotcha.

Now does that mean if I have 5 threads all writing to a local shard,
will that shard piggyhop those index requests onto a SINGLE connection
to the leader? Or will they spawn 5 connections from the shard to the
leader? I really hope the formerthe latter won't scale well.

On Fri, 2012-04-20 at 10:28 -0400, Jamie Johnson wrote:
 my understanding is that you can send your updates/deletes to any
 shard and they will be forwarded to the leader automatically.  That
 being said your leader will always be the place where the index
 happens and then distributed to the other replicas.
 
 On Fri, Apr 20, 2012 at 7:54 AM, Darren Govoni dar...@ontrenet.com wrote:
  Hi,
   I just wanted to make sure I understand how distributed indexing works
  in solrcloud.
 
  Can I index locally at each shard to avoid throttling a central port? Or
  all the indexing has to go through a single shard leader?
 
  thanks
 
 
 




Re: Opposite to MoreLikeThis?

2012-04-20 Thread Darren Govoni
You could run the MLT for the document in question, then gather all
those doc id's in the MLT results and negate those in a subsequent
query. Not sure how robust that would work with very large result sets,
but something to try.

Another approach would be to gather the interesting terms from the
document in question and then negate those terms in subsequent queries.
Perhaps with many negated terms, Solr will rank the results based on
most negated terms above less negated terms, simulating a ranked less
like effect.

On Fri, 2012-04-20 at 15:38 -0700, Charlie Maroto wrote:
 Hi all,
 
 Is there a way to implement the opposite to MoreLikeThis (LessLikeThis, I
 guess :).  The requirement we have is to remove all documents with content
 like that of a given document id or a text provided by the end-user.  In
 the current index implementation (not using Solr), the user can narrow
 results by indicating what document(s) are not relevant to him and then
 request to remove from the search results any document whose content is
 like that of the selected document(s)
 
 Our index has close to 100 million documents and they cover multiple topics
 that are not related to one another.  So, a search for some broad terms may
 retrieve documents about engineering, agriculture, communications, etc.  As
 the user is trying to discover the relevant documents, he may select an
 agriculture-related document to exclude it and those documents like it from
 the results set; same w/ engineering-like content, etc. until most of the
 documents are about communications.
 
 Of course, some exclusions may actually remove relevant content but those
 filters can be removed to go back to the previous set of results.
 
 Any ideas from similar implementations or suggestions are welcomed!
 Thanks,
 Carlos




Re: hierarchical faceting?

2012-04-18 Thread Darren Govoni
Put the parent term in all the child documents at index time
and the re-issue the facet query when you expand the parent using the
parent's term. works perfect.

On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote:
 I have hierarchical colors:
 field name=colors type=text_pathindexed=true
 stored=true multiValued=true/
 text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.
 
 Given these two documents,
 Doc1: red
 Doc2: red/pink
 
 I want the result to be the following:
 ?fq=red
 == Doc1, Doc2
 
 ?fq=red/pink
 == Doc2
 
 But, with PathHierarchyTokenizer, Doc1 is included for the query:
 ?fq=red/pink
 == Doc1, Doc2
 
 How can I query for hierarchical facets?
 http://wiki.apache.org/solr/HierarchicalFaceting describes facet.prefix..
 But it looks too cumbersome to me.
 
 Is there a simpler way to implement hierarchical facets?




Re: hierarchical faceting?

2012-04-18 Thread Darren Govoni
I don't use any of that stuff in my app, so not sure how it works.

I just manage my taxonomy outside of solr at index time and don't need
any special fields or tokenizers. I use a string field type and insert
the proper field at index time and query it normally. Nothing special
required.

On Wed, 2012-04-18 at 13:00 -0400, sam ” wrote:
 It looks like TextField is the problem.
 
 This fixed:
 fieldType name=text_path class=solr.TextField
 positionIncrementGap=100
   analyzer type=index
   tokenizer class=solr.PathHierarchyTokenizerFactory
 delimiter=//
   /analyzer
   analyzer type=query
   tokenizer class=solr.WhitespaceTokenizerFactory/
   /analyzer
 /fieldType
 
 I am assuming the text_path fields won't include whitespace characters.
 
 ?q=colors:red/pink
 == Doc2   (Doc1, which has colors = red isn't included!)
 
 
 Is there a tokenizer that tokenizes the string as one token?
 I tried to extend Tokenizer myself  but it fails:
 public class AsIsTokenizer extends Tokenizer {
 @Override
 public boolean incrementToken() throws IOException {
 return true;//or false;
 }
 }
 
 
 On Wed, Apr 18, 2012 at 11:33 AM, sam ” skyn...@gmail.com wrote:
 
  Yah, that's exactly what PathHierarchyTokenizer does.
  fieldType name=text_path class=solr.TextField
  positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.PathHierarchyTokenizerFactory/
/analyzer
  /fieldType
 
  I think I have a query time tokenizer that tokenizes at /
 
  ?q=colors:red
  == Doc1, Doc2
 
  ?q=colors:redfoobar
  ==
 
  ?q=colors:red/foobarasdfoaijao
  == Doc1, Doc2
 
 
 
 
  On Wed, Apr 18, 2012 at 11:10 AM, Darren Govoni dar...@ontrenet.comwrote:
 
  Put the parent term in all the child documents at index time
  and the re-issue the facet query when you expand the parent using the
  parent's term. works perfect.
 
  On Wed, 2012-04-18 at 10:56 -0400, sam ” wrote:
   I have hierarchical colors:
   field name=colors type=text_pathindexed=true
   stored=true multiValued=true/
   text_path is TextField with PathHierarchyTokenizerFactory as tokenizer.
  
   Given these two documents,
   Doc1: red
   Doc2: red/pink
  
   I want the result to be the following:
   ?fq=red
   == Doc1, Doc2
  
   ?fq=red/pink
   == Doc2
  
   But, with PathHierarchyTokenizer, Doc1 is included for the query:
   ?fq=red/pink
   == Doc1, Doc2
  
   How can I query for hierarchical facets?
   http://wiki.apache.org/solr/HierarchicalFaceting describes
  facet.prefix..
   But it looks too cumbersome to me.
  
   Is there a simpler way to implement hierarchical facets?
 
 
 
 




Re: Monitoring SolrCloud health

2012-04-14 Thread Darren Govoni
Can you be more specific about health?

On Sat, 2012-04-14 at 00:03 -0400, Jamie Johnson wrote:
 How do people currently monitor the health of a solr cluster?  Are
 there any good tools which can show the health across the entire
 cluster?  Is this something which is planned for the new admin user
 interface?
 




RE: Realtime /get versus SearchHandler

2012-04-13 Thread Darren Govoni

Yes

brbrbr--- Original Message ---
On 4/13/2012  06:25 AM Benson Margulies wrote:brA discussion over on the dev 
list led me to expect that the by-if
brfield retrievals in a SolrCloud query would come through the get
brhandler. In fact, I've seen them turn up in my search component in the
brsearch handler that is configured with my custom QT. (I have a
br'prepare' method that sets ShardParams.QT to my QT to get my
brprocessing involved in the first of the two queries.) Did I overthink
brthis?
br
br


Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni
You could use SolrCloud (for the automatic scaling) and just mount a
fuse[1] HDFS directory and configure solr to use that directory for its
data. 

[1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS

On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
 Hi,
 
 I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
 using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
 crawled + indexed every *4 weeks, *with a search latency of less than 0.5
 seconds.
 
 Needless to mention, the search index needs to scale to 5Billion pages. It
 is also possible that I might need to store multiple indexes -- one for
 crawled content, and one for ancillary data that is also very large. Each
 of these indices would likely require a logically distributed and
 replicated index.
 
 However, I would like for such a system to be homogenous with the Hadoop
 infrastructure that is already installed on the cluster (for the crawl). In
 other words, I would much prefer if the replication and distribution of the
 Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
 using another scalability framework (such as SolrCloud). In addition, it
 would be ideal if this environment was flexible enough to be dynamically
 scaled based on the size requirements of the index and the search traffic
 at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
 enough to automatically provision additional processing power into the
 cluster without requiring server re-starts).
 
 However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
 be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
 Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
 mature enough and would be the right architectural choice to go along with
 a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
 above.
 
 Lastly, how much hardware (assuming a medium sized EC2 instance) would you
 estimate my needing with this setup, for regular web-data (HTML text) at
 this scale?
 
 Any architectural guidance would be greatly appreciated. The more details
 provided, the wider my grin :).
 
 Many many thanks in advance.
 
 Thanks,
 Safdar




RE: Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment

2012-04-12 Thread Darren Govoni

Solrcloud or any other tech specific replication isnt going to 'just work' with 
hadoop replication. But with some significant custom coding anything should be 
possible. Interesting idea.

brbrbr--- Original Message ---
On 4/12/2012  09:21 AM Ali S Kureishy wrote:brThanks Darren.
br
brActually, I would like the system to be homogenous - i.e., use Hadoop based
brtools that already provide all the necessary scaling for the lucene index
br(in terms of throughput, latency of writes/reads etc). Since SolrCloud adds
brits own layer of sharding/replication that is outside Hadoop, I feel that
brusing SolrCloud would be redundant, and a step in the opposite
brdirection, which is what I'm trying to avoid in the first place. Or am I
brmistaken?
br
brThanks,
brSafdar
br
br
brOn Thu, Apr 12, 2012 at 4:27 PM, Darren Govoni dar...@ontrenet.com wrote:
br
br You could use SolrCloud (for the automatic scaling) and just mount a
br fuse[1] HDFS directory and configure solr to use that directory for its
br data.
br
br [1] https://ccp.cloudera.com/display/CDHDOC/Mountable+HDFS
br
br On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
br  Hi,
br 
br  I'm trying to setup a large scale *Crawl + Index + Search 
*infrastructure
br  using Nutch and Solr/Lucene. The targeted scale is *5 Billion web 
pages*,
br  crawled + indexed every *4 weeks, *with a search latency of less than 
0.5
br  seconds.
br 
br  Needless to mention, the search index needs to scale to 5Billion pages.
br It
br  is also possible that I might need to store multiple indexes -- one for
br  crawled content, and one for ancillary data that is also very large. 
Each
br  of these indices would likely require a logically distributed and
br  replicated index.
br 
br  However, I would like for such a system to be homogenous with the Hadoop
br  infrastructure that is already installed on the cluster (for the crawl).
br In
br  other words, I would much prefer if the replication and distribution of
br the
br  Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead 
of
br  using another scalability framework (such as SolrCloud). In addition, it
br  would be ideal if this environment was flexible enough to be dynamically
br  scaled based on the size requirements of the index and the search 
traffic
br  at the time (i.e. if it is deployed on an Amazon cluster, it should be
br easy
br  enough to automatically provision additional processing power into the
br  cluster without requiring server re-starts).
br 
br  However, I'm not sure which Solr-based tool in the Hadoop ecosystem 
would
br  be ideal for this scenario. I've heard mention of Solr-on-HBase,
br Solandra,
br  Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these
br is
br  mature enough and would be the right architectural choice to go along
br with
br  a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling
br aspects
br  above.
br 
br  Lastly, how much hardware (assuming a medium sized EC2 instance) would
br you
br  estimate my needing with this setup, for regular web-data (HTML text) at
br  this scale?
br 
br  Any architectural guidance would be greatly appreciated. The more 
details
br  provided, the wider my grin :).
br 
br  Many many thanks in advance.
br 
br  Thanks,
br  Safdar
br
br
br
br


Re: I've broken delete in SolrCloud and I'm a bit clueless as to how

2012-04-11 Thread Darren Govoni
Hard to say why its not working for you. Start with a fresh Solr and
work forward from there or back out your configs and plugins until it
works again.

On Tue, 2012-04-10 at 17:15 -0400, Benson Margulies wrote:
 In my cloud configuration, if I push
 
 delete
   query*:*/query
 /delete
 
 followed by:
 
 commit/
 
 I get no errors, the log looks happy enough, but the documents remain
 in the index, visible to /query.
 
 Here's what seems my relevant bit of solrconfig.xml. My URP only
 implements processAdd.
 
updateRequestProcessorChain name=RNI
 !-- some day, add parameters when we have some --
 processor 
 class=com.basistech.rni.solr.NameIndexingUpdateRequestProcessorFactory/
 processor class=solr.LogUpdateProcessorFactory /
 processor class=solr.DistributedUpdateProcessorFactory/
 processor class=solr.RunUpdateProcessorFactory /
   /updateRequestProcessorChain
 
 !-- activate RNI processing by adding the RNI URP to the chain
 for xml updates --
   requestHandler name=/update
   class=solr.XmlUpdateRequestHandler
 lst name=defaults
   str name=update.chainRNI/str
 /lst
 /requestHandler
 




RE: SOLR issue - too many search queries

2012-04-10 Thread Darren Govoni

My first reaction to your question is why are you running thousands of queries 
in a loop? Immediately, I think this will not scale well and the design 
probably needs to be re-visited.

Second, if you need that many requests, then you need to seriously consider an 
architecture that supports it. This will require a complex design involving 
load balancers, multiple servers, replication, etc. People have achieved this 
with Solr, but it's beyond the scope of Solr itself to provide this, as its a 
matter of system architecture.

Also, there are limits to the number of app server threads allowed, OS threads 
allowed, OS sockets, OS file descriptors, etc. etc. All of which need to be 
understood, designed for and configured properly.


brbrbr--- Original Message ---
On 4/10/2012  07:51 AM arunssasidhar wrote:brWe have a PHP web application 
which is using SOLR for searching. The APP is
brusing CURL to connect to the SOLR server and which run in a loop with
brthousands of predefined keywords. That will create thousands of different
brsearch quires to SOLR at a given time.
br
brMy issue is that, when a single user logged into the app everything is
brworking as expected. When there is more than one user is trying to run the
brapp we are getting this response from the server.
br
brFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested
braddressFailed to connect to xxx.xxx.xxx.xxx: Cannot assign requested
braddressFailed
br
brOur assumption is that, SOLR server is unable to handle this much search
brqueries at a given time. If so what is the solution to overcome this?. Is
brthere any settings like keep-alive in SOLR?
br
brAny help would be highly appreciate.
br
brThanks,
br
brArun S
br
br
br--
brView this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-issue-too-many-search-queries-tp3899518p3899518.html
brSent from the Solr - User mailing list archive at Nabble.com.
br
br


RE: Re: Cloud-aware request processing?

2012-04-09 Thread Darren Govoni

...it is a distributed real-time query scheme...

SolrCloud does this already. It treats all the shards like one-big-index, and you can 
query it normally to get subset results from each shard. Why do you have to 
re-write the query for each shard? Seems unnecessary.

brbrbr--- Original Message ---
On 4/9/2012  08:45 AM Benson Margulies wrote:br Jan Høydahl,
br
brMy problem is intimately connected to Solr. it is not a batch job for
brhadoop, it is a distributed real-time query scheme. I hate to add yet
branother complex framework if a Solr RP can do the job simply.
br
brFor this problem, I can transform a Solr query into a subset query on
breach shard, and then let the SolrCloud mechanism.
br
brI am well aware of the 'zoo' of alternatives, and I will be evaluating
brthem if I can't get what I want from Solr.
br
brOn Mon, Apr 9, 2012 at 9:34 AM, Jan Høydahl jan@cominvent.com wrote:
br Hi,
br
br Instead of using Solr, you may want to have a look at Hadoop or another 
framework for distributed computation, see e.g. 
http://java.dzone.com/articles/comparison-gridcloud-computing
br
br --
br Jan Høydahl, search solution architect
br Cominvent AS - www.cominvent.com
br Solr Training - www.solrtraining.com
br
br On 9. apr. 2012, at 13:41, Benson Margulies wrote:
br
br I'm working on a prototype of a scheme that uses SolrCloud to, in
br effect, distribute a computation by running it inside of a request
br processor.
br
br If there are N shards and M operations, I want each node to perform
br M/N operations. That, of course, implies that I know N.
br
br Is that fact available anyplace inside Solr, or do I need to just 
configure it?
br
br
br


Re: How to facet data from a multivalued field?

2012-04-09 Thread Darren Govoni
Your handler for that field should be looked at.
Try not using a handler that tokenizes or stems the field.
You want to leave the text as is. I forget the handler setting for that,
but its documented in there somewhere.


On Mon, 2012-04-09 at 13:02 -0700, Thiago wrote:
 Hello everybody,
 
 I've already searched about this topic in the forum, but I didn't find any
 case like this. I ask for apologizes if this topic have been already
 discussed.
 
 I'm having a problem in faceting a multivalued field. My field is called
 series, and it has names of TV series like the big bang theory, two and a
 half men ...
 
 In this field I can have a lot of TV series names. For example:
 
 arr name=series
strTwo and a Half Men/str
strHow I Met Your Mother/str
strThe Big Bang Theory/str
 /arr
 
 What I want to do is: search and count how many documents related to each
 series. I'm doing it using facet search in this field. But it's returning
 each word separately. Like this:
 
 lst name=facet_counts
 lst name=facet_queries/
 lst name=facet_fields
 lst name=series
int name=bang91/int
int name=big91/int
int name=half21/int
int name=how45/int
int name=i45/int
int name=men21/int
int name=met45/int
int name=mother45/int
int name=theori91/int
int name=two21/int
int name=your45/int
 /lst
 /lst
 lst name=facet_dates/
 lst name=facet_ranges/
 /lst
 
 And what I want is something like:
 
 lst name=facet_counts
 lst name=facet_queries/
 lst name=facet_fields
 lst name=series
int name=Two and a Half Men21/int
int name=How I Met Your Mother45/int
int name=The Big Bang Theory91/int
 /lst
 /lst
 lst name=facet_dates/
 lst name=facet_ranges/
 /lst
 
 Is there any possible way to do it with facet search? I don't want the
 terms, I just want each string including the white spaces. Do I have to
 change my fieldtype to do this?
 
 Thanks to everybody.
 
 Thiago 
 
 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/How-to-facet-data-from-a-multivalued-field-tp3897853p3897853.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 




No webadmin for trunk?

2012-04-07 Thread Darren Govoni
Hi,
  Just updated solr trunk and tried the java -jar start.jar and
localhost:8983/solr/admin.not found.

Where did it go?

thanks.



Re: No webadmin for trunk?

2012-04-07 Thread Darren Govoni
HTTP ERROR: 404
Problem accessing /solr. Reason:

Not Found



Powered by Jetty://

On Sat, 2012-04-07 at 09:04 -0400, Jamie Johnson wrote:
 just go to localhost:8983/solr and you'll see the updated interface.
 
 On Sat, Apr 7, 2012 at 8:23 AM, Darren Govoni dar...@ontrenet.com wrote:
  Hi,
   Just updated solr trunk and tried the java -jar start.jar and
  localhost:8983/solr/admin.not found.
 
  Where did it go?
 
  thanks.
 
 




Re: No webadmin for trunk?

2012-04-07 Thread Darren Govoni
start.jar has no apps in it at all.

On Sat, 2012-04-07 at 09:47 -0400, Darren Govoni wrote:
 HTTP ERROR: 404
 Problem accessing /solr. Reason:
 
 Not Found
 
 
 
 Powered by Jetty://
 
 On Sat, 2012-04-07 at 09:04 -0400, Jamie Johnson wrote:
  just go to localhost:8983/solr and you'll see the updated interface.
  
  On Sat, Apr 7, 2012 at 8:23 AM, Darren Govoni dar...@ontrenet.com wrote:
   Hi,
Just updated solr trunk and tried the java -jar start.jar and
   localhost:8983/solr/admin.not found.
  
   Where did it go?
  
   thanks.
  
  
 
 




Re: No webadmin for trunk?

2012-04-07 Thread Darren Govoni
Yep. I did all kinds of ant clean, ant dist, ant example, etc.

My trunk rev.

At revision 1310773.

Example start.jar is broken. No webapp inside. :(

On Sat, 2012-04-07 at 16:11 +0200, Rafał Kuć wrote:
 Hello!
 
 Did you run 'ant example' ?
 




Re: No webadmin for trunk?

2012-04-07 Thread Darren Govoni
K. There is a solr.war in the webapps directory. But still get the 404.

On Sat, 2012-04-07 at 16:19 +0200, Rafał Kuć wrote:
 Hello!
 
 start.jar shouldn't contain any webapp. If you look at the 'example'
 directory, you'll notice that there is a 'webapps' directory which
 should contain solr.war file.
 
 Btw. revision 1307647 works without a problem. I'll checkout trunk in
 a few in try with the newest revision.
 




Re: No webadmin for trunk?

2012-04-07 Thread Darren Govoni
Now, it comes up. Not sure why its acting weird. Will continue to look
at it.

On Sat, 2012-04-07 at 10:23 -0400, Darren Govoni wrote:
 K. There is a solr.war in the webapps directory. But still get the 404.
 
 On Sat, 2012-04-07 at 16:19 +0200, Rafał Kuć wrote:
  Hello!
  
  start.jar shouldn't contain any webapp. If you look at the 'example'
  directory, you'll notice that there is a 'webapps' directory which
  should contain solr.war file.
  
  Btw. revision 1307647 works without a problem. I'll checkout trunk in
  a few in try with the newest revision.
  
 
 




Re: upgrade 3.5 to 4.0

2012-04-07 Thread Darren Govoni
In my opinion, its never a good idea to overwrite files of a previous
version with a new version. 

The easiest thing would be to just deploy the solr war file into tomcat
and let tomcat manage the webapp, files, etc.

On Sat, 2012-04-07 at 22:39 -0400, Dan Foley wrote:
 I have download the nightly snapshot of v 4.0 and would like to install it
 to my tomcat install of solr 3.5
 
 can i simply overwrite the current files or is there a correct method for
 doing so?
 
 please advise.. thanks
 




Re: Does any one know when Solr 4.0 will be released.

2012-04-04 Thread Darren Govoni
No one knows. But if you ask the devs, they will say 'when its done'.

One clue might be to monitor the bugs/issues scheduled for 4.0. When
they are all resolved, then its ready.

On Wed, 2012-04-04 at 09:41 -0700, srinivas konchada wrote:
 Hello every one
 Does any one know when Solr 4.0 will be released? there is a specific
 feature that exists in 4.0 which we want to take advantage off. Problem is
 we cannot deploy some thing into production from trunk. We need to use an
 official release.
 
 
 Thanks
 Srinivas Konchada




Re: Duplicates in Facets

2012-04-04 Thread Darren Govoni
Try using Luke to look at your index and see if there are multiple
similar TFV's. You can browse them easily in Luke.

On Wed, 2012-04-04 at 23:35 -0400, Jamie Johnson wrote:
 I am currently indexing some information and am wondering why I am
 getting duplicates in facets.  From what I can tell they are the same,
 but is there any case that could cause this that I may not be thinking
 of?  Could this be some non printable character making it's way into
 the index?
 
 
 Sample output from luke
 
 lst name=fields
   lst name=organization_umvs
 str name=typestring/str
 str name=schemaI--M---OFl/str
 str name=dynamicBase*_umvs/str
 str name=index(unstored field)/str
 int name=docs332/int
 int name=distinct-1/int
 lst name=topTerms
   int name=ORGANIZATION 1328/int
   int name=ORGANIZATION 2124/int
   int name=ORGANIZATION 236/int
   int name=ORGANIZATION 220/int
   int name=ORGANIZATION 34/int
 /lst
 




Custom scoring question

2012-03-29 Thread Darren Govoni
Hi,
 I have a situation I want to re-score document relevance.

Let's say I have two fields:

text: The quick brown fox jumped over the white fence.
terms: fox fence

Now my queries come in as:

terms:[* TO *]

and Solr scores them on that field. 

What I want is to rank them according to the distribution of field
terms within field text. Which is a per document calculation.

Can this be done with any kind of dismax? I'm not searching for known
terms at query time.

If not, what is the best way to implement a custom scoring handler to
perform this calculation and re-score/sort the results?

thanks for any tips!!!



Re: Custom scoring question

2012-03-29 Thread Darren Govoni
I'm going to try index time per-field boosting and do the boost
computation at index time and see if that helps.

On Thu, 2012-03-29 at 10:08 -0400, Darren Govoni wrote:
 Hi,
  I have a situation I want to re-score document relevance.
 
 Let's say I have two fields:
 
 text: The quick brown fox jumped over the white fence.
 terms: fox fence
 
 Now my queries come in as:
 
 terms:[* TO *]
 
 and Solr scores them on that field. 
 
 What I want is to rank them according to the distribution of field
 terms within field text. Which is a per document calculation.
 
 Can this be done with any kind of dismax? I'm not searching for known
 terms at query time.
 
 If not, what is the best way to implement a custom scoring handler to
 perform this calculation and re-score/sort the results?
 
 thanks for any tips!!!
 




Re: Custom scoring question

2012-03-29 Thread Darren Govoni
Yeah, I guess that would work. I wasn't sure if it would change relative
to other documents. But if it were to be combined with other fields,
that approach may not work because the calculation wouldn't include the
scoring for other parts of the query. So then you have the dynamic score
and what to do with it.

On Thu, 2012-03-29 at 16:29 -0300, Tomás Fernández Löbbe wrote:
 Can't you simply calculate that at index time and assign the result to a
 field, then sort by that field.
 
 On Thu, Mar 29, 2012 at 12:07 PM, Darren Govoni dar...@ontrenet.com wrote:
 
  I'm going to try index time per-field boosting and do the boost
  computation at index time and see if that helps.
 
  On Thu, 2012-03-29 at 10:08 -0400, Darren Govoni wrote:
   Hi,
I have a situation I want to re-score document relevance.
  
   Let's say I have two fields:
  
   text: The quick brown fox jumped over the white fence.
   terms: fox fence
  
   Now my queries come in as:
  
   terms:[* TO *]
  
   and Solr scores them on that field.
  
   What I want is to rank them according to the distribution of field
   terms within field text. Which is a per document calculation.
  
   Can this be done with any kind of dismax? I'm not searching for known
   terms at query time.
  
   If not, what is the best way to implement a custom scoring handler to
   perform this calculation and re-score/sort the results?
  
   thanks for any tips!!!
  
 
 
 




MLT and solrcloud?

2012-03-22 Thread Darren Govoni
Hi,
  It was mentioned before that SolrCloud has all the capability of
regular solr (including handlers) with the exception of the MLT handler.
As this is a key capability for Solr, is there work planned to include
the MLT in SolrCloud? If so when? Our efforts greatly depend on it. As
such, I'm happy to help anyway possible.

thanks,
Darren



Re: MLT and solrcloud?

2012-03-22 Thread Darren Govoni
Ok, I'll do what I can to help!

As always, appreciate the hard work Mark.


On Thu, 2012-03-22 at 17:31 -0400, Mark Miller wrote:
 On Mar 22, 2012, at 5:22 PM, Darren Govoni wrote:
 
  Hi,
   It was mentioned before that SolrCloud has all the capability of
  regular solr (including handlers) with the exception of the MLT handler.
  As this is a key capability for Solr, is there work planned to include
  the MLT in SolrCloud? If so when? Our efforts greatly depend on it. As
  such, I'm happy to help anyway possible.
  
  thanks,
  Darren
  
 
 Usually no real time tables here :) Depends on who jumps in when.
 
 Some work has already gone on for this here: 
 https://issues.apache.org/jira/browse/SOLR-788
 
 You might just try and jump start that issue again? As I get a free moment or 
 two, I'm happy to help commit a solution.
 
 - Mark Miller
 lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 




RE: Re: maxClauseCount Exception

2012-03-19 Thread Darren Govoni

true. but how can you find documents containing that field without expanding 
1000 clauses?

brbrbr--- Original Message ---
On 3/19/2012  07:24 AM Erick Erickson wrote:brbq: So all I want to do is a simple 
all docs with something in this field,
brand to highlight the field
br
brBut that doesn't really make sense to do at the Solr/Lucene level. All
bryou're saying is that you want that field highlighted. Wouldn't it be much
breasier to just do this at the app level whenever your field had anything
brreturned in it?
br
brBest
brErick
br
brOn Sat, Mar 17, 2012 at 8:07 PM, Darren Govoni dar...@ontrenet.com wrote:
br Thanks for the tip Hoss.
br
br I notice that it appears sometimes and was varying because my index runs
br would sometimes have different amount of docs, etc.
br
br So all I want to do is a simple all docs with something in this field,
br and to highlight the field.
br
br Is the query expansion to all possible terms in the index really
br necessary? I could have 100's of thousands of possible terms. Why should
br they all become explicit query elements? Seems overkill and
br underperformant.
br
br Is there a another way with Lucene or not really?
br
br On Thu, 2012-03-08 at 16:18 -0800, Chris Hostetter wrote:
br :   I am suddenly getting a maxClauseCount exception for no reason. I am
br : using Solr 3.5. I have only 206 documents in my index.
br
br Unless things have changed the reason you are seeing this is because
br _highlighting_ a query (clause) like type_s:[*+TO+*] requires rewriting
br it into a giant boolean query of all the terms in that field -- so even 
if
br you only have 206 docs, if you have more then 206 values in that field in
br your index, you're going to go over 1024 terms.
br
br (you don't get this problem in a basic query, because it doens't need to
br enumerate all the terms, it rewrites it to a ConstantScoreQuery)
br
br what you most likeley want to do, is move some of those clauses like
br type_s:[*+TO+*]: and usergroup_sm:admin) out of your main q query 
and
br into fq filters ... so they can be cached independently, won't
br contribute to scoring (just matching) and won't be used in highlighting.
br
br : 
params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2}
 hits=204 status=500 QTime=166 |#]
br
br : [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1|
br : org.apache.solr.servlet.SolrDispatchFilter|
br : 
_ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery
br : $TooManyClauses: maxClauseCount is set to 1024
br :     at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136)
br       ...
br :     at
br : 
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304)
br :     at
br : 
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)
br
br -Hoss
br
br
br
br
br


Re: Inconsistent Results with ZooKeeper Ensemble and Four SOLR Cloud Nodes

2012-03-18 Thread Darren Govoni
I think he's asking if all the nodes (same machine or not) return a
response. Presumably you have different ports for each node since they
are on the same machine.

On Sun, 2012-03-18 at 14:44 -0400, Matthew Parker wrote:
 The cluster is running on one machine.
 
 On Sun, Mar 18, 2012 at 2:07 PM, Mark Miller markrmil...@gmail.com wrote:
 
  From every node in your cluster you can hit http://MACHINE1:8084/solr in
  your browser and get a response?
 
  On Mar 18, 2012, at 1:46 PM, Matthew Parker wrote:
 
   My cloud instance finally tried to sync. It looks like it's having
  connection issues, but I can bring the SOLR instance up in the browser so
  I'm not sure why it cannot connect to it. I got the following condensed log
  output:
  
   org.apache.commons.httpclient.HttpMethodDirector executeWithRetry
   I/O exception (java.net.ConnectException) caught when processing
  request: Connection refused: connect
  
   org.apache.commons.httpclient.HttpMethodDirector executeWithRetry
   I/O exception (java.net.ConnectException) caught when processing
  request: Connection refused: connect
  
   org.apache.commons.httpclient.HttpMethodDirector executeWithRetry
   I/O exception (java.net.ConnectException) caught when processing
  request: Connection refused: connect
  
   Retrying request
  
   shard update error StdNode:
  http://MACHINE1:8084/solr/:org.apache.solr.client.solrj.SolrServerException:
  http://MACHINE1:8084/solr
  at
  org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:
  483)
   ..
   ..
   ..
Caused by: java.net.ConnectException: Connection refused: connect
  at java.net.DualStackPlainSocketImpl.connect0(Native Method)
   ..
   ..
   ..
  
   try and ask http://MACHINE1:8084/solr to recover
  
   Could not tell a replica to recover
  
   org.apache.solr.client.solrj.SolrServerException:
  http://MACHINE1:8084/solr
 at
  org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483)
 ...
 ...
 ...
   Caused by: java.net.ConnectException: Connection refused: connect
  at java.net.DualStackPlainSocketImpl.waitForConnect(Native method)
  ..
  ..
  ..
  
   On Sat, Mar 17, 2012 at 10:10 PM, Mark Miller markrmil...@gmail.com
  wrote:
   Nodes talk to ZooKeeper as well as to each other. You can see the
  addresses they are trying to use to communicate with each other in the
  'cloud' view of the Solr Admin UI. Sometimes you have to override these, as
  the detected default may not be an address that other nodes can reach. As a
  limited example: for some reason my mac cannot talk to my linux box with
  its default detected host address of halfmetal:8983/solr - but the mac can
  reach my linux box if I use halfmetal.Local - so I have to override the
  published address of my linux box using the host attribute if I want to
  setup a cluster between my macbook and linux box.
  
   Each nodes talks to ZooKeeper to learn about the other nodes, including
  their addresses. Recovery is then done node to node using the appropriate
  addresses.
  
  
   - Mark Miller
   lucidimagination.com
  
   On Mar 16, 2012, at 3:00 PM, Matthew Parker wrote:
  
I'm still having issues replicating in my work environment. Can anyone
explain how the replication mechanism works? Is it communicating across
ports or through zookeeper to manager the process?
   
   
   
   
On Thu, Mar 8, 2012 at 10:57 PM, Matthew Parker 
mpar...@apogeeintegration.com wrote:
   
All,
   
I recreated the cluster on my machine at home (Windows 7, Java
  1.6.0.23,
apache-solr-4.0-2012-02-29_09-07-30) , sent some document through
  Manifold
using its crawler, and it looks like it's replicating fine once the
documents are committed.
   
This must be related to my environment somehow. Thanks for your help.
   
Regards,
   
Matt
   
On Fri, Mar 2, 2012 at 9:06 AM, Erick Erickson 
  erickerick...@gmail.comwrote:
   
Matt:
   
Just for paranoia's sake, when I was playing around with this (the
_version_ thing was one of my problems too) I removed the entire data
directory as well as the zoo_data directory between experiments (and
recreated just the data dir). This included various index.2012
files and the tlog directory on the theory that *maybe* there was
  some
confusion happening on startup with an already-wonky index.
   
If you have the energy and tried that it might be helpful
  information,
but it may also be a total red-herring
   
FWIW
Erick
   
On Thu, Mar 1, 2012 at 8:28 PM, Mark Miller markrmil...@gmail.com
wrote:
I assuming the windows configuration looked correct?
   
Yeah, so far I can not spot any smoking gun...I'm confounded at the
moment. I'll re read through everything once more...
   
- Mark
   
   
   
  
  
  
  
  
  
  
  
  
  
  
  
  
  
   

Re: maxClauseCount Exception

2012-03-17 Thread Darren Govoni
Thanks for the tip Hoss.

I notice that it appears sometimes and was varying because my index runs
would sometimes have different amount of docs, etc.

So all I want to do is a simple all docs with something in this field,
and to highlight the field. 

Is the query expansion to all possible terms in the index really
necessary? I could have 100's of thousands of possible terms. Why should
they all become explicit query elements? Seems overkill and
underperformant.

Is there a another way with Lucene or not really?

On Thu, 2012-03-08 at 16:18 -0800, Chris Hostetter wrote:
 :   I am suddenly getting a maxClauseCount exception for no reason. I am
 : using Solr 3.5. I have only 206 documents in my index.
 
 Unless things have changed the reason you are seeing this is because 
 _highlighting_ a query (clause) like type_s:[*+TO+*] requires rewriting 
 it into a giant boolean query of all the terms in that field -- so even if 
 you only have 206 docs, if you have more then 206 values in that field in 
 your index, you're going to go over 1024 terms.
 
 (you don't get this problem in a basic query, because it doens't need to 
 enumerate all the terms, it rewrites it to a ConstantScoreQuery)
 
 what you most likeley want to do, is move some of those clauses like 
 type_s:[*+TO+*]: and usergroup_sm:admin) out of your main q query and 
 into fq filters ... so they can be cached independently, won't 
 contribute to scoring (just matching) and won't be used in highlighting.
 
 : 
 params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2}
  hits=204 status=500 QTime=166 |#]
 
 : [#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1|
 : org.apache.solr.servlet.SolrDispatchFilter|
 : _ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery
 : $TooManyClauses: maxClauseCount is set to 1024
 : at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136)
   ...
 : at
 : org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304)
 : at
 : 
 org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)
 
 -Hoss
 




RE: Solr 4.0 and production environments

2012-03-07 Thread Darren Govoni

As a rule of thumb, many will say not to go to production with a pre-release baseline. So until 
Solr4 goes final and stable, it's best not to assume too much about it.

Second suggestion is to properly stage new technologies in your product such 
that they go through their own validation. And so to that end, jump right in 
and start using Solr4 and see for yourself! It's a great technology.

brbrbr--- Original Message ---
On 3/7/2012  11:47 AM Dirceu Vieira wrote:brHi All,
br
brHas anybody started using Solr 4.0 in production environments? Is it stable
brenough?
brI'm planning to create a proof of concept using solr 4.0, we have some
brprojects that will gain a lot with features such as near real time search,
brjoins and others, that are available only on version 4.
br
brIs it too risky to think of using it right now?
brWhat are your thoughts and experiences with that?
br
brBest regards,
br
br-- 
brDirceu Vieira Júnior

br---
br+47 9753 2473
brdirceuvjr.blogspot.com
brtwitter.com/dirceuvjr
br


Re: Building a resilient cluster

2012-03-06 Thread Darren Govoni
What I think was mentioned on this a bit ago is that the index stops
working if one of the nodes goes down unless its a replica.

You have 2 nodes running with numShards=2? Thus if one goes down the
entire index is inoperable. In the future I'm hoping this changes such
that the index cluster continues to operate but will lack results from
the downed node. Maybe this has changed in recent trunk updates though.
Not sure.

On Mon, 2012-03-05 at 20:49 -0800, Ranjan Bagchi wrote:
 Hi Mark,
 
 So I tried this: started up one instance w/ zookeeper, and started a second
 instance defining a shard name in solr.xml -- it worked, searching would
 search both indices, and looking at the zookeeper ui, I'd see the second
 shard.  However, when I brought the second server down -- the first one
 stopped working:  it didn't kick the second shard out of the cluster.
 
 Any way to do this?
 
 Thanks,
 
 Ranjan
 
 
  From: Mark Miller markrmil...@gmail.com
  To: solr-user@lucene.apache.org
  Cc:
  Date: Wed, 29 Feb 2012 22:57:26 -0500
  Subject: Re: Building a resilient cluster
  Doh! Sorry - this was broken - I need to fix the doc or add it back.
 
  The shard id is actually set in solr.xml since its per core - the sys prop
  was a sugar option we had setup. So either add 'shard' to the core in
  solr.xml, or to make it work like it does in the doc, do:
 
   core name=collection1 shard=${shard:} instanceDir=. /
 
  That sets shard to the 'shard' system property if its set, or as a default,
  act as if it wasn't set.
 
  I've been working with custom shard ids mainly through solrj, so I hadn't
  noticed this.
 
  - Mark
 
  On Wed, Feb 29, 2012 at 10:36 AM, Ranjan Bagchi ranjan.bag...@gmail.com
  wrote:
 
   Hi,
  
   At this point I'm ok with one zk instance being a point of failure, I
  just
   want to create sharded solr instances, bring them into the cluster, and
  be
   able to shut them down without bringing down the whole cluster.
  
   According to the wiki page, I should be able to bring up new shard by
  using
   shardId [-D shardId], but when I did that, the logs showed it replicating
   an existing shard.
  
   Ranjan
   Andre Bois-Crettez wrote:
  
You have to run ZK on a at least 3 different machines for fault
tolerance (a ZK ensemble).
   
   
  
  http://wiki.apache.org/solr/SolrCloud#Example_C:_Two_shard_cluster_with_sha=
rd_replicas_and_zookeeper_ensemble
   
Ranjan Bagchi wrote:
 Hi,

 I'm interested in setting up a solr cluster where each machine [at
   least
 initially] hosts a separate shard of a big index [too big to sit on
  the
 machine].  I'm able to put a cloud together by telling it that I have
   (to
 start out with) 4 nodes, and then starting up nodes on 3 machines
pointin=
g
 at the zkInstance.  I'm able to load my sharded data onto each
  machine
 individually and it seems to work.

 My concern is that it's not fault tolerant:  if one of the
   non-zookeeper
 machines falls over, the whole cluster won't work.  Also, I can't
   create
=
a
 shard with more data, and have it work within the existing cloud.

 I tried using -DshardId=3Dshard5 [on an existing 4-shard cluster],
  but
   it
 just started replicating, which doesn't seem right.

 Are there ways around this?

 Thanks,
 Ranjan Bagchi


  
 
 
 
  --
  - Mark
 
  http://www.lucidimagination.com
 
 




Re: [SoldCloud] Slow indexing

2012-03-05 Thread darren
A question relating to this.

If you are running a single ZK node, but say 10 other nodes and then
parallel index on each of those nodes, will the ZK be hit by all 10
indexing nodes constantly? i.e. very chatty?

If one of those 10 indexing nodes goes down or falls out of sync and comes
back, does ZK block the state of indexing until that single node catches
back up?


 On Mar 4, 2012, at 5:43 PM, Markus Jelsma wrote:

 everything stalls after it lists all segment files and that a ZK state
 change has occured.

 Can you get a stack trace here? I'll try to respond to more tomorrow. What
 version of trunk are you using? We have been making fixes and improvements
 all the time, so need to get a frame of reference.

 When a client node cannot talk to zookeeper, because it may not know
 certain things it should (what if a leader changes?), it must reject
 updates (searches will still work). Why can't the node talk to zookeeper?
 Perhaps the load is so high on the server, it cannot respond to zk within
 the session timeout? I really don't know yet. When this happens though, it
 forces a recovery when/if the node can reconnect to zookeeper.

 We have not yet started on optimizing bulk indexing - currently an update
 is added locally *before* sending updates in parallel to each replica.
 Then we wait for each response before responding to the client. We plan to
 offer more optimizations and options around this.

 Feed back will be useful in making some of these improvements.


 - Mark Miller
 lucidimagination.com















Re: Trunk build errors

2012-02-23 Thread darren
I updated yesterday and did an ant clean, ant test.

I will try a clean pull next.

I'm on linux. Perhaps an ant version issue?

 There was recently some work done to get better about checking
 on licenses, when did you last get trunk? About 9 days ago was
 the last go-round.

 And did you do an 'ant clean'?

 It works on my machine with a fresh pull this morning.

 Best
 Erick

 On Wed, Feb 22, 2012 at 5:27 PM, Darren Govoni dar...@ontrenet.com
 wrote:
 Hi,
  I am getting numerous errors preventing a build of solrcloud trunk.

  [licenses] MISSING LICENSE for the following file:
 

 Any tips to get a clean build working?

 thanks






maxClauseCount error

2012-02-22 Thread Darren Govoni
Hi,
  I am suddenly getting a maxclause count error and don't know why. I am
using Solr 3.5





maxClauseCount Exception

2012-02-22 Thread Darren Govoni
Hi,
  I am suddenly getting a maxClauseCount exception for no reason. I am
using Solr 3.5. I have only 206 documents in my index.

Any ideas? This is wierd.

QUERY PARAMS: [hl, hl.snippets, hl.simple.pre, hl.simple.post, fl,
hl.mergeContiguous, hl.usePhraseHighlighter, hl.requireFieldMatch,
echoParams, hl.fl, q, rows, start]|#]


[#|2012-02-22T13:40:13.129-0500|INFO|glassfish3.1.1|
org.apache.solr.core.SolrCore|_ThreadID=22;_ThreadName=Thread-2;|[]
webapp=/solr3 path=/select
params={hl=truehl.snippets=4hl.simple.pre=b/bfl=*,scorehl.mergeContiguous=truehl.usePhraseHighlighter=truehl.requireFieldMatch=trueechoParams=allhl.fl=text_tq={!lucene+q.op%3DOR+df%3Dtext_t}+(+kind_s:doc+OR+kind_s:xml)+AND+(type_s:[*+TO+*])+AND+(usergroup_sm:admin)rows=20start=0wt=javabinversion=2}
 hits=204 status=500 QTime=166 |#]


[#|2012-02-22T13:40:13.131-0500|SEVERE|glassfish3.1.1|
org.apache.solr.servlet.SolrDispatchFilter|
_ThreadID=22;_ThreadName=Thread-2;|org.apache.lucene.search.BooleanQuery
$TooManyClauses: maxClauseCount is set to 1024
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:136)
at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:127)
at org.apache.lucene.search.ScoringRewrite
$1.addClause(ScoringRewrite.java:51)
at org.apache.lucene.search.ScoringRewrite
$1.addClause(ScoringRewrite.java:41)
at org.apache.lucene.search.ScoringRewrite
$3.collect(ScoringRewrite.java:95)
at
org.apache.lucene.search.TermCollectingRewrite.collectTerms(TermCollectingRewrite.java:38)
at
org.apache.lucene.search.ScoringRewrite.rewrite(ScoringRewrite.java:93)
at
org.apache.lucene.search.MultiTermQuery.rewrite(MultiTermQuery.java:304)
at
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:158)
at
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:98)
at
org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:385)
at
org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:217)
at
org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:185)
at
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:205)
at
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490)
at
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)
at
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131)
at org.apache.so



Trunk build errors

2012-02-22 Thread Darren Govoni
Hi,
  I am getting numerous errors preventing a build of solrcloud trunk.

 [licenses] MISSING LICENSE for the following file:


Any tips to get a clean build working?

thanks




filter query or boolean?

2012-02-21 Thread darren

Hi,
  Which is faster for boolean compound expressions. filter queries or a
single query with boolean expressions?
For that matter, is there any difference other than maybe speed?

thanks


Re: SolrJ + SolrCloud

2012-02-12 Thread Darren Govoni
Thanks Mark. Is there any plan to make all the Solr search handlers work
with SolrCloud, like MLT? That missing feature would prohibit us from
using SolrCloud at the moment. :(

On Sat, 2012-02-11 at 18:24 -0500, Mark Miller wrote:
 On Feb 11, 2012, at 6:02 PM, Darren Govoni wrote:
 
  Hi,
   Do all the normal facilities of Solr work with SolrCloud from SolrJ?
  Things like /mlt, /cluster, facets , tvf's, etc.
  
  Darren
  
 
 
 SolrJ works the same in SolrCloud mode as it does in non SolrCloud mode - 
 it's fully supported. There is even a new SolrJ client called CloudSolrServer 
 that has built in cluster awareness and load balancing.
 
 In terms of what is supported - anything that is supported with distributed 
 search - that is most things, but there is the odd man out - like MLT - looks 
 like an issue is open here: https://issues.apache.org/jira/browse/SOLR-788 
 but it's not resolved yet.
 
 - Mark Miller
 lucidimagination.com
 
 
 
 
 
 
 
 
 
 
 




SolrJ + SolrCloud

2012-02-11 Thread Darren Govoni
Hi,
  Do all the normal facilities of Solr work with SolrCloud from SolrJ?
Things like /mlt, /cluster, facets , tvf's, etc.

Darren



Re: Range facet - Count in facet menu != Count in search results

2012-02-10 Thread Darren Govoni
Double check your default operator for a faceted search vs. regular
search. I caught this difference in my work that explained this
difference.

On Fri, 2012-02-10 at 07:45 -0800, Yuhao wrote:
 Jay,
 
 Was the curly closing bracket } intentional?  I'm using 3.4, which also 
 supports fq=price:[10 TO 20].  The problem is the results are not working 
 properly.
 
 
 
 
 
  From: Jan Høydahl jan@cominvent.com
 To: solr-user@lucene.apache.org; Yuhao nfsvi...@yahoo.com 
 Sent: Thursday, February 9, 2012 7:45 PM
 Subject: Re: Range facet - Count in facet menu != Count in search results
  
 Hi,
 
 If you use trunk (4.0) version, you can say fq=price:[10 TO 20} and have the 
 upper bound be exclusive.
 
 --
 Jan Høydahl, search solution architect
 Cominvent AS - www.cominvent.com
 Solr Training - www.solrtraining.com
 
 On 10. feb. 2012, at 00:58, Yuhao wrote:
 
  I've changed the facet.range.include option to every possible value 
  (lower, upper, edge, outer, all)**.  It only changes the count shown in the 
  Ranges facet menu on the left.  It has no effect on the count and results 
  shown in search results, which ALWAYS is inclusive of both the lower AND 
  upper bounds (which is equivalent to include = all).  Is this by design?  
  I would like to make the search results include the lower bound, but not 
  the upper bound.  Can I do that?
  
  My range field is multi-valued, but I don't think that should be the 
  problem.
  
  ** Actually, it doesn't like outer for some reason, which leaves the 
  facet completely empty.




Re: SolrCloud is in trunk.

2012-02-08 Thread darren

Good job on this work. A monumental effort.

On Wed, 8 Feb 2012 16:41:13 -0500, Mark Miller markrmil...@gmail.com
wrote:
 For those that are interested and have not noticed, the latest work on
 SolrCloud and distributed indexing is now in trunk.
 
 SolrCloud is our name for a new set of distributed capabilities that
 improve upon the old style distributed search and index based
replication.
 
 It provides for high availability and fault tolerance while allowing for
 near realtime search and an interface that matches what you are used to
 with previous versions of Solr.
 
 We are looking to release this in the next 4.0 release, and any feedback
 early users can provide will be very useful. So if you have an interest
in
 these types of features, please take the latest trunk build for a spin
and
 provide some feedback. 
 
 There is still a lot more planned, so feel free to chime in on what you
 would like to see - this is essentially the end of stage one. 
 
 You can read more about what we have done on the wiki:
 http://wiki.apache.org/solr/SolrCloud
 
 Also, a couple blog posts I recently saw pop up:
 

http://blog.sematext.com/2012/02/01/solrcloud-distributed-realtime-search
 http://outerthought.org/blog/491-ot.html
 
 I'll contribute my own blog post as well when I get a chance, but there
 should be a fair amount of info there to get you started if you are
 interested. 
 
 Thanks,
 
 - Mark Miller
 lucidimagination.com


Re: SolrCloud war?

2012-02-03 Thread Darren Govoni

UPDATE:

I set my app server[1] system property jetty.port to be equal to the app 
servers open port and was able to get two Solr shards to talk.


The overall properties I set are:

App server domain 1:

bootstrap_confdir
collection.configName
jetty.port
solr.solr.home
zkRun

App server domain 2:

bootstrap_confdir
collection.configName
jetty.port
solr.solr.home
zkHost

I deployed each war app into the /solr context. I presume its needed 
by remote URL addressing.

I checked the zookeeper config page and it shows both shards.

Awesome.

[1] Glassfish 3.1.1

On 02/01/2012 08:50 PM, Mark Miller wrote:

I have not yet tried to run SolrCloud in another app server, but it shouldn't 
be a problem.

One issue you might have is the fact that we count on hostPort coming from the 
system property jetty.port. This is set in the default solr.xml - the hostPort 
defaults to jetty.port. You probably want to explicitly pass -DhostPort= if you 
are not going to use jetty.port.


- Mark Miller
lucidimagination.com
On Feb 1, 2012, at 2:44 PM, Darren Govoni wrote:


Hi,
  I'm trying to get the SolrCloud2 examples to work using a war deployed solr 
into glassfish.
The startup properties must be different in this case, because its having 
trouble connecting to zookeeper when
I deploy the solr war file.

Perhaps the embedded zookeeper has trouble running in an app server?

Any tips appreciated!

Darren

On 01/30/2012 06:58 PM, Darren Govoni wrote:

Hi,
  Is there any issue with running the new SolrCloud deployed as a war in 
another app server?
Has anyone tried this yet?

thanks.

















  1   2   3   >