Re: svn commit: r577427 - in /lucene/solr/trunk/client/java/solrj/test/org/apache/solr/client/solrj: LargeVolumeTestBase.java embedded/LargeVolumeEmbeddedTest.java embedded/LargeVolumeJettyTest.java

2007-09-21 Thread Thorsten Scherler
On Thu, 2007-09-20 at 11:11 -0700, Chris Hostetter wrote:
> : You can put it in the lib home and ant will find junit. I have it in my
> : project. 
> 
> : I can submit a patch tomorrow if you want.
> 
> I'm not sure which "lib home" you are talking about, but I'm certainly 
> open to a patch that allows us to bundle junit we want so new developers 
> don't have to get it themselves ... i tried doing this not too long ago 
> when i read that  could speficy classpaths for finding the task 
> -- it worked great forsome things (like PMD) but i couldn't get it to work 
> with ant to save my life.

done

https://issues.apache.org/jira/browse/SOLR-362

salu2

> 
> 
> : I am importing solr build scripts in my project and build it from my
> : project without problem because the junit.jar is in my classpath. As
> : soon I want to build solr directly I can't because the  junit
> : check.
> 
> (Note: even if we can't get a patch working that does this, this specific 
> problem is easy to deal with: even if junit isn't in your ANT_LIB, you can 
> always use ant's -lib option to do this too.)
> 
> 
> -Hoss
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



[jira] Updated: (SOLR-362) bundle junit with solr

2007-09-21 Thread Thorsten Scherler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thorsten Scherler updated SOLR-362:
---

Attachment: junit.include.diff
junit-4.3.jar

Patch to bundle junit with solr.

> bundle junit with solr
> --
>
> Key: SOLR-362
> URL: https://issues.apache.org/jira/browse/SOLR-362
> Project: Solr
>  Issue Type: Improvement
>Reporter: Thorsten Scherler
> Attachments: junit-4.3.jar, junit.include.diff
>
>
> http://marc.info/?t=11902336334&r=1&w=2
> ant -version
> Apache Ant version 1.7.0 compiled on December 13 2006
> Trivial modification to the build.xml to include junit with solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-362) bundle junit with solr

2007-09-21 Thread Thorsten Scherler (JIRA)
bundle junit with solr
--

 Key: SOLR-362
 URL: https://issues.apache.org/jira/browse/SOLR-362
 Project: Solr
  Issue Type: Improvement
Reporter: Thorsten Scherler


http://marc.info/?t=11902336334&r=1&w=2

ant -version
Apache Ant version 1.7.0 compiled on December 13 2006

Trivial modification to the build.xml to include junit with solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (SOLR-363) Use ivy for dependency resolving

2007-09-21 Thread Thorsten Scherler (JIRA)
Use ivy for dependency resolving


 Key: SOLR-363
 URL: https://issues.apache.org/jira/browse/SOLR-363
 Project: Solr
  Issue Type: Improvement
Reporter: Thorsten Scherler


First cut on ivy support.

Removed all libs in lib/ but not yet for the web app.

Patch includes basic ivy rep (tar) for all jars that are ATM not in ivy nor in 
maven.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: svn commit: r577427 - in /lucene/solr/trunk/client/java/solrj/test/org/apache/solr/client/solrj: LargeVolumeTestBase.java embedded/LargeVolumeEmbeddedTest.java embedded/LargeVolumeJettyTest.java

2007-09-21 Thread Thorsten Scherler
On Thu, 2007-09-20 at 13:51 -0400, Erik Hatcher wrote:
...
> I'm happy to see contributions both for Ivy and Maven2 based builds  
> of Solr.  I'm sure we can do it in a non-intrusive way to the current  
> Ant build so that folks can try it out.  With a patch and a wiki page  
> with instructions that'd be good enough to get the ball rolling.

done

https://issues.apache.org/jira/browse/SOLR-363

salu2

-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



[jira] Updated: (SOLR-363) Use ivy for dependency resolving

2007-09-21 Thread Thorsten Scherler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thorsten Scherler updated SOLR-363:
---

Attachment: ivy.support.diff
ivy.tar

> Use ivy for dependency resolving
> 
>
> Key: SOLR-363
> URL: https://issues.apache.org/jira/browse/SOLR-363
> Project: Solr
>  Issue Type: Improvement
>Reporter: Thorsten Scherler
> Attachments: ivy.support.diff, ivy.tar
>
>
> First cut on ivy support.
> Removed all libs in lib/ but not yet for the web app.
> Patch includes basic ivy rep (tar) for all jars that are ATM not in ivy nor 
> in maven.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-363) Use ivy for dependency resolving

2007-09-21 Thread Thorsten Scherler (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529361
 ] 

Thorsten Scherler commented on SOLR-363:


After applying the patch do:

ant ivy.resolve

If we want to keep the ivy support we should add a dependency to the compile 
target.

> Use ivy for dependency resolving
> 
>
> Key: SOLR-363
> URL: https://issues.apache.org/jira/browse/SOLR-363
> Project: Solr
>  Issue Type: Improvement
>Reporter: Thorsten Scherler
> Attachments: ivy.support.diff, ivy.tar
>
>
> First cut on ivy support.
> Removed all libs in lib/ but not yet for the web app.
> Patch includes basic ivy rep (tar) for all jars that are ATM not in ivy nor 
> in maven.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-359) LukeResponse.FieldInfo should expose the selected flags info and the indexInfo

2007-09-21 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529506
 ] 

Grant Ingersoll commented on SOLR-359:
--

I am working on extending the LukeRequestHandler (LRH) and the LukeResponse to 
give more info about the FieldTypes.  Right now, the LRH returns the name of 
the FieldType, but does not return the class name.  Also, I would like to 
return back more structured access to the FieldType information by extracting 
it from the IndexInfo in the same way the FieldInfo does now.

Currently,  I am mirroring what is done for the FieldInfo to add FieldTypeInfo. 

One question, though, LRH currently writes out the instance of the Analyzer 
(around line 371 of LRH), would it be better to just return the class name?  Or 
do people actually need the value of Analyzer.toString()

Also, do I need to worry about back-compatibility of the message?  That is, if 
I add in the class name of the FieldType instance (i.e. solr.IntField, etc.) 
does that break the compatibility of the message?

I will post a new patch shortly which has more access to the FieldType info.

> LukeResponse.FieldInfo should expose the selected flags info and the indexInfo
> --
>
> Key: SOLR-359
> URL: https://issues.apache.org/jira/browse/SOLR-359
> Project: Solr
>  Issue Type: Improvement
>  Components: clients - java
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-359.patch
>
>
> The LukeResponse has access, when reading in the NamedList to more 
> information that is not currently being shared, like the "flags" for each 
> FieldInfo and/or the indexInfo object so that more info about the fields may 
> be used by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-215) Multiple Solr Cores - remove static singleton

2007-09-21 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529509
 ] 

Yonik Seeley commented on SOLR-215:
---

FYI, firstSearcher/newSearcher hooks are now broken because the constructors to 
AbstractSolrEventListener was changed to take a SolrCore, and the code in 
SolrCore that creates event listeners does a simple newInstance()

> Multiple Solr Cores - remove static singleton
> -
>
> Key: SOLR-215
> URL: https://issues.apache.org/jira/browse/SOLR-215
> Project: Solr
>  Issue Type: Improvement
>Reporter: Henri Biestro
>Priority: Minor
> Fix For: 1.3
>
> Attachments: solr-215.patch, solr-215.patch, solr-215.patch, 
> solr-215.patch, solr-215.patch.zip, solr-215.patch.zip, solr-215.patch.zip, 
> solr-215.patch.zip, solr-215.patch.zip, solr-215.patch.zip, 
> solr-215.patch.zip, solr-trunk-533775.patch, solr-trunk-538091.patch, 
> solr-trunk-542847-1.patch, solr-trunk-542847.patch, solr-trunk-src.patch
>
>
> WHAT:
> As of 1.2, Solr only instantiates one SolrCore which handles one Lucene index.
> This patch is intended to allow multiple cores in Solr which also brings 
> multiple indexes capability.
> The patch file to grab is solr-215.patch.zip (see MISC session below).
> WHY:
> The current Solr practical wisdom is that one schema - thus one index - is 
> most likely to accomodate your indexing needs, using a filter to segregate 
> documents if needed. If you really need multiple indexes, deploy multiple web 
> applications.
> There are a some use cases however where having multiple indexes or multiple 
> cores through Solr itself may make sense.
> Multiple cores:
> Deployment issues within some organizations where IT will resist deploying 
> multiple web applications.
> Seamless schema update where you can create a new core and switch to it 
> without starting/stopping servers.
> Embedding Solr in your own application (instead of 'raw' Lucene) and 
> functionally need to segregate schemas & collections.
> Multiple indexes:
> Multiple language collections where each document exists in different 
> languages, analysis being language dependant.
> Having document types that have nothing (or very little) in common with 
> respect to their schema, their lifetime/update frequencies or even collection 
> sizes.
> HOW:
> The best analogy is to consider that instead of deploying multiple 
> web-application, you can have one web-application that hosts more than one 
> Solr core. The patch does not change any of the core logic (nor the core 
> code); each core is configured & behaves exactly as the one core in 1.2; the 
> various caches are per-core & so is the info-bean-registry.
> What the patch does is replace the SolrCore singleton by a collection of 
> cores; all the code modifications are driven by the removal of the different 
> singletons (the config, the schema & the core).
> Each core is 'named' and a static map (keyed by name) allows to easily manage 
> them.
> You declare one servlet filter mapping per core you want to expose in the 
> web.xml; this allows easy to access each core through a different url. 
> USAGE (example web deployment, patch installed):
> Step0
> java -Durl='http://localhost:8983/solr/core0/update' -jar post.jar solr.xml 
> monitor.ml
> Will index the 2 documents in solr.xml & monitor.xml
> Step1:
> http://localhost:8983/solr/core0/admin/stats.jsp
> Will produce the statistics page from the admin servlet on core0 index; 2 
> documents
> Step2:
> http://localhost:8983/solr/core1/admin/stats.jsp
> Will produce the statistics page from the admin servlet on core1 index; no 
> documents
> Step3:
> java -Durl='http://localhost:8983/solr/core0/update' -jar post.jar ipod*.xml
> java -Durl='http://localhost:8983/solr/core1/update' -jar post.jar mon*.xml
> Adds the ipod*.xml to index of core0 and the mon*.xml to the index of core1;
> running queries from the admin interface, you can verify indexes have 
> different content. 
> USAGE (Java code):
> //create a configuration
> SolrConfig config = new SolrConfig("solrconfig.xml");
> //create a schema
> IndexSchema schema = new IndexSchema(config, "schema0.xml");
> //create a core from the 2 other.
> SolrCore core = new SolrCore("core0", "/path/to/index", config, schema);
> //Accessing a core:
> SolrCore core = SolrCore.getCore("core0"); 
> PATCH MODIFICATIONS DETAILS (per package):
> org.apache.solr.core:
> The heaviest modifications are in SolrCore & SolrConfig.
> SolrCore is the most obvious modification; instead of a singleton, there is a 
> static map of cores keyed by names and assorted methods. To retain some 
> compatibility, the 'null' named core replaces the singleton for the relevant 
> methods, for instance SolrCore.getCore(). One small constraint on the core 
> name is they can't contain '/' or '

HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Yonik Seeley
I wanted to take a step back for a second and think about if HTTP was
really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.

Pro HTTP:
  - using HTTP allows one to use an http load-balancer to distribute
load across multiple copies of the same shard by assigning a VIP
(virtual IP) to each shard.
  - because you do pretty much everything by hand, you know that there
isn't some hidden limitation that will jump out and bite you later.

Cons HTTP:
 - you end up doing everything by hand... connection handling, request
serialization, response parsing, etc...
 - goes through normal servlet channels... every sub-request will be
logged to the access logs, slowing things down.
- more network bandwidth used unless we come up with a new
BinaryResponseWriter and Parser

Currently, SOLR-303 uses and parses the XML response format, which has
some serious downsides:
- response size limits scalability and how deep in responses you can go...
  If you want to retrieve documents 5000 through 5009, even though the
user only requested 10 documents, the top-level searcher needs to get
the top 5009 documents from *each* shard... and that can quickly
exhaust the network bandwidth of the NIC.  XML parsing on the order of
nShards*5009 entries won't be any picnic either.

I'm thinking the load-balancing of HTTP is overrated also, because
it's inflexible.  Adding another shard requires adding another VIP in
the load-balancer, and changing which servers have which shards or
adding new copies of a shard also requires load-balancer
configuration.  Everything points to Solr being able to do the
load-balancing itself in the future, and there wouldn't seem to be
much benefit to using a load-balancer w/ VIPS for each shard vs having
Solr do it.

So even if we stuck with HTTP, Solr would need
 - a binary protocol to minimize network bandwidth use
 - load balancing across shard copies itself

Given that, would it make sense to just go with RMI instead?
And perhaps leverage some other higher level services (Jini? JavaSpaces?)

I'd like to hear from people with more experience with RMI & friends,
and what the potential downsides are to using these technologies.

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Walter Underwood
Please don't switch to RMI. We've spent the past year converting
our entire middle tier from RMI to HTTP. We are so glad that we
no longer have any RMI servers.

The big advantage of HTTP is that there are hundreds, maybe
thousands, of engineers working on making it fast, on tools for it,
on caches, etc.

If you really need more compact responses, I would recommend
coding the JSON output in Python marshal format. That is compact,
fast, and easy to parse. We used that for a Java client in Ultraseek.

wunder

On 9/21/07 11:08 AM, "Yonik Seeley" <[EMAIL PROTECTED]> wrote:

> I wanted to take a step back for a second and think about if HTTP was
> really the right choice for the transport for distributed search.
> 
> I think the high-level approach in SOLR-303 is the right way to go
> about it, but I'm unsure if HTTP is the right transport.
> 
> Pro HTTP:
>   - using HTTP allows one to use an http load-balancer to distribute
> load across multiple copies of the same shard by assigning a VIP
> (virtual IP) to each shard.
>   - because you do pretty much everything by hand, you know that there
> isn't some hidden limitation that will jump out and bite you later.
> 
> Cons HTTP:
>  - you end up doing everything by hand... connection handling, request
> serialization, response parsing, etc...
>  - goes through normal servlet channels... every sub-request will be
> logged to the access logs, slowing things down.
> - more network bandwidth used unless we come up with a new
> BinaryResponseWriter and Parser
> 
> Currently, SOLR-303 uses and parses the XML response format, which has
> some serious downsides:
> - response size limits scalability and how deep in responses you can go...
>   If you want to retrieve documents 5000 through 5009, even though the
> user only requested 10 documents, the top-level searcher needs to get
> the top 5009 documents from *each* shard... and that can quickly
> exhaust the network bandwidth of the NIC.  XML parsing on the order of
> nShards*5009 entries won't be any picnic either.
> 
> I'm thinking the load-balancing of HTTP is overrated also, because
> it's inflexible.  Adding another shard requires adding another VIP in
> the load-balancer, and changing which servers have which shards or
> adding new copies of a shard also requires load-balancer
> configuration.  Everything points to Solr being able to do the
> load-balancing itself in the future, and there wouldn't seem to be
> much benefit to using a load-balancer w/ VIPS for each shard vs having
> Solr do it.
> 
> So even if we stuck with HTTP, Solr would need
>  - a binary protocol to minimize network bandwidth use
>  - load balancing across shard copies itself
> 
> Given that, would it make sense to just go with RMI instead?
> And perhaps leverage some other higher level services (Jini? JavaSpaces?)
> 
> I'd like to hear from people with more experience with RMI & friends,
> and what the potential downsides are to using these technologies.
> 
> -Yonik



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Yonik Seeley
On 9/21/07, Walter Underwood <[EMAIL PROTECTED]> wrote:
> Please don't switch to RMI. We've spent the past year converting
> our entire middle tier from RMI to HTTP. We are so glad that we
> no longer have any RMI servers.

Just to be clear for everyone, this wouldn't be a front-end change...
HTTP load balancer over top-level searches would still be the normal
way to do HA / query-load scaling.

This is more about traffic between Solr servers themselves for
distributed search (something that doesn't even exist yet).

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread eks dev
just small a notice regarding load balancing, we have used for long time one 
simple scheme where all shards were identical.  With MMAP Directory and our 
"load balancer" had rather simple "clustering" logic to direct queries to 
preffered Index Searcher Units (when available, when not to the first free  
neighbor), this simple clustering helped OS to cache Index much much better ... 
super simple setup (identical shards), but big win ...

the point I am trying to make here; with http load balancer something like this 
is rather difficult (I am not too familiar with http load balancers, but...)

- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: solr-dev@lucene.apache.org
Sent: Friday, 21 September, 2007 8:08:05 PM
Subject: HTTP or RMI, Jini, JavaSpaces for distributed search

I wanted to take a step back for a second and think about if HTTP was
really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.

Pro HTTP:
  - using HTTP allows one to use an http load-balancer to distribute
load across multiple copies of the same shard by assigning a VIP
(virtual IP) to each shard.
  - because you do pretty much everything by hand, you know that there
isn't some hidden limitation that will jump out and bite you later.

Cons HTTP:
 - you end up doing everything by hand... connection handling, request
serialization, response parsing, etc...
 - goes through normal servlet channels... every sub-request will be
logged to the access logs, slowing things down.
- more network bandwidth used unless we come up with a new
BinaryResponseWriter and Parser

Currently, SOLR-303 uses and parses the XML response format, which has
some serious downsides:
- response size limits scalability and how deep in responses you can go...
  If you want to retrieve documents 5000 through 5009, even though the
user only requested 10 documents, the top-level searcher needs to get
the top 5009 documents from *each* shard... and that can quickly
exhaust the network bandwidth of the NIC.  XML parsing on the order of
nShards*5009 entries won't be any picnic either.

I'm thinking the load-balancing of HTTP is overrated also, because
it's inflexible.  Adding another shard requires adding another VIP in
the load-balancer, and changing which servers have which shards or
adding new copies of a shard also requires load-balancer
configuration.  Everything points to Solr being able to do the
load-balancing itself in the future, and there wouldn't seem to be
much benefit to using a load-balancer w/ VIPS for each shard vs having
Solr do it.

So even if we stuck with HTTP, Solr would need
 - a binary protocol to minimize network bandwidth use
 - load balancing across shard copies itself

Given that, would it make sense to just go with RMI instead?
And perhaps leverage some other higher level services (Jini? JavaSpaces?)

I'd like to hear from people with more experience with RMI & friends,
and what the potential downsides are to using these technologies.

-Yonik





  ___ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html



[jira] Resolved: (SOLR-362) bundle junit with solr

2007-09-21 Thread Erik Hatcher (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Hatcher resolved SOLR-362.
---

   Resolution: Fixed
Fix Version/s: 1.3
 Assignee: Erik Hatcher

Thanks Thorsten!

> bundle junit with solr
> --
>
> Key: SOLR-362
> URL: https://issues.apache.org/jira/browse/SOLR-362
> Project: Solr
>  Issue Type: Improvement
>Reporter: Thorsten Scherler
>Assignee: Erik Hatcher
> Fix For: 1.3
>
> Attachments: junit-4.3.jar, junit.include.diff
>
>
> http://marc.info/?t=11902336334&r=1&w=2
> ant -version
> Apache Ant version 1.7.0 compiled on December 13 2006
> Trivial modification to the build.xml to include junit with solr.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (SOLR-359) LukeResponse.FieldInfo should expose the selected flags info and the indexInfo

2007-09-21 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated SOLR-359:
-

Attachment: SOLR-359-v2.patch

This patch builds on the last one to provide map backed access to FieldType 
metadata in a similar vein to the current FieldInfo.  It adds the className of 
the FieldType into the LukeRequestHandler types NamedList.

> LukeResponse.FieldInfo should expose the selected flags info and the indexInfo
> --
>
> Key: SOLR-359
> URL: https://issues.apache.org/jira/browse/SOLR-359
> Project: Solr
>  Issue Type: Improvement
>  Components: clients - java
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: SOLR-359-v2.patch, SOLR-359.patch
>
>
> The LukeResponse has access, when reading in the NamedList to more 
> information that is not currently being shared, like the "flags" for each 
> FieldInfo and/or the indexInfo object so that more info about the fields may 
> be used by the client.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Stu Hood
There is an additional limitation to the current HTTP limitation, in that it 
does a boolean OR of all the uniq keys it needs to fetch on remote shards Once 
you query for ~100 or so documents, the URL can grow to > 1 characters 
long, and Java refuses to open it. To make it work we'd need to be able to POST 
to Solr for results, which I don't think is possible yet...

I completely agree that using some kind of remote procedure call for queries to 
other shards makes sense.

Perhaps we should attempt to "merge" the RMI backend of SOLR-255 with the high 
level approach taken in SOLR-303? Or do you have more experience with other 
brands of RPC?

Thanks a lot,
Stu



-Original Message-
From: Yonik Seeley <[EMAIL PROTECTED]>
Sent: Friday, September 21, 2007 2:53pm
To: [EMAIL PROTECTED]
Subject: Re: HTTP or RMI, Jini, JavaSpaces for distributed search

On 9/21/07, Walter Underwood  wrote:
> Please don't switch to RMI. We've spent the past year converting
> our entire middle tier from RMI to HTTP. We are so glad that we
> no longer have any RMI servers.

Just to be clear for everyone, this wouldn't be a front-end change...
HTTP load balancer over top-level searches would still be the normal
way to do HA / query-load scaling.

This is more about traffic between Solr servers themselves for
distributed search (something that doesn't even exist yet).

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Mike Klaas

On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote:


I wanted to take a step back for a second and think about if HTTP was
really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.


I don't know anything about RMI, but is it possible to do 100's of  
simultaneous asynchronous requests cheaply?  (It is with http, as you  
can use a select() loop on the sockets).


FWIW, our distributed search uses http over 120+ shards... and is  
written in python.  We will likely end up using a java solution  
eventually, though--despite great optimization efforts, python is  
having trouble keeping up.


-Mike


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Yonik Seeley
On 9/21/07, Stu Hood <[EMAIL PROTECTED]> wrote:
> Perhaps we should attempt to "merge" the RMI backend of SOLR-255 with the 
> high level approach taken in SOLR-303? Or do you have more experience with 
> other brands of RPC?

I think any approach is best done at that high level... so an RMI
interface would share some of the same logic with the dispatch filter.
 It's the high-level coarsed grained approach that will get us good
performance... trying to pretend that remote objects are local and
executing the same logic one would in a single server isn't a good
idea.

I have very little RMI experience... which is why I initially thought
of HTTP too, but that's not the best excuse :-)

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread eks dev
why not taking a look at hadoop's IPC/RPC ? it is small, simple and elegant 
with no latency like on RMI (we could not do better than 30-50ms per hop), 
Nutch uses it 




  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Yonik Seeley
On 9/21/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote:
>
> > I wanted to take a step back for a second and think about if HTTP was
> > really the right choice for the transport for distributed search.
> >
> > I think the high-level approach in SOLR-303 is the right way to go
> > about it, but I'm unsure if HTTP is the right transport.
>
> I don't know anything about RMI, but is it possible to do 100's of
> simultaneous asynchronous requests cheaply?

Good question... probably only important for really big clusters (like
yours), but it would be nice.

Even if we go HTTP, I'm not sure it will be async at first - does
HTTPClient even support async?

I assume when you say async that you mean getting rid of the
thread-per-connection via NIO.  Some protocols do "async" by handing
off the request to another thread to wait on the response and then do
a callback to the original thread - this is async with respect to the
original calling thread, but still requires a thread-per-connection.

Of course HTTP has some issues too - you effectively need a separate
connection per outstanding request.  Pipelining won't work well
because things need to come back in-order.  I'm not sure if RMI has
this limitation as well.

> FWIW, our distributed search uses http over 120+ shards... and is
> written in python.

That would be an awesome test case if you were able to use what Solr
is going to provide out-of-the-box.  Any unusual requirements?

-Yonik


[jira] Updated: (SOLR-303) Federated Search over HTTP

2007-09-21 Thread Stu Hood (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stu Hood updated SOLR-303:
--

Attachment: fedsearch.stu.patch

Here is another revision of the latest patch (I've still only tried it with 
r574785: I'm a bit crunched for time).

*Resolved issues:*
* We were forgetting to increment a counter during the last step in 
SecondQPhaseComponent.process, and so we weren't getting results from all 
shards.
* SecondQPhaseComponent.merge was throwing away any fields that already existed 
in a document, and so it was throwing away parts of multi-value fields. Fixing 
this exposed the first issue listed below.
* MultiSearchRequestHandler was creating non-daemon threads (the default) for 
the thread pool. This meant that when the JVM died, the threads were sticking 
around. I added a ThreadFactory that creates daemonized threads.

*Open issues:*
* The 'local' shard is ignoring the 'FL' parameter during the FirstQueryPhase, 
and returning the entire document. We then try and merge the document into 
itself in SecondQPhaseComponent.merge, causing a ConcurrectMod exception. For 
now, I put a check for "newDoc != oldDoc", but I think we need to figure out 
why the local query is returning full documents.
* Range queries are broken (probably due to the extract terms phase failing)
* 'start' and 'numfound' are incorrect when returned to the user
** start is getting wiped out somewhere
** numfound is counting all copies of matches for a uniqKey towards the total
* MultiSearchRequestHandler.THREAD_POOL_SIZE and 
MultiSearchRequestHandler.REQUEST_TIME_OUT_IN_MS should be configuration 
parameters in solrconfig.xml.

Thanks a lot!

> Federated Search over HTTP
> --
>
> Key: SOLR-303
> URL: https://issues.apache.org/jira/browse/SOLR-303
> Project: Solr
>  Issue Type: New Feature
>  Components: search
>Reporter: Sharad Agarwal
>Priority: Minor
> Attachments: fedsearch.patch, fedsearch.patch, fedsearch.patch, 
> fedsearch.stu.patch, fedsearch.stu.patch
>
>
> Motivated by http://wiki.apache.org/solr/FederatedSearch
> "Index view consistency between multiple requests" requirement is relaxed in 
> this implementation.
> Does the federated search query side. Update not yet done.
> Tries to achieve:-
> 
> - The client applications are totally agnostic to federated search. The 
> federated search and merging of results are totally behind the scene in Solr 
> in request handler . Response format remains the same after merging of 
> results.
> The response from individual shard is deserialized into SolrQueryResponse 
> object. The collection of SolrQueryResponse objects are merged to produce a 
> single SolrQueryResponse object. This enables to use the Response writers as 
> it is; or with minimal change.
> - Efficient query processing with highlighting and fields getting generated 
> only for merged documents. The query is executed in 2 phases. First phase 
> gets the doc unique keys with sort criteria. Second phase brings all 
> requested fields and highlighting information. This saves lot of CPU in case 
> there are good number of shards and highlighting info is requested.
> Should be easy to customize the query execution. For example: user can 
> specify to execute query in just 1 phase itself. (For some queries when 
> highlighting info is not required and number of fields requested are small; 
> this can be more efficient.)
> - Ability to easily overwrite the default Federated capability by appropriate 
> plugins and request parameters. As federated search is performed by the 
> RequestHandler itself, multiple request handlers can easily be pre-configured 
> with different federated search settings in solrconfig.xml
> - Global weight calculation is done by querying the terms' doc frequencies 
> from all shards.
> - Federated search works on Http transport. So individual shard's VIP can be 
> queried. Load-balancing and Fail-over taken care by VIP as usual.
> -Sub-searcher response parsing as a plugin interface. Different 
> implementation could be written based on JSON, xml SAX etc. Current one based 
> on XML DOM.
> HOW:
> ---
> A new RequestHandler called MultiSearchRequestHandler does the federated 
> search on multiple sub-searchers, (referred as "shards" going forward). It 
> extends the RequestHandlerBase. handleRequestBody method in 
> RequestHandlerBase has been divided into query building and execute methods. 
> This has been done to calculate global numDocs and docFreqs; and execute the 
> query efficiently on multiple shards.
> All the "search" request handlers are expected to extend 
> MultiSearchRequestHandler class in order to enable federated capability for 
> the handler. StandardRequestHandler and DisMaxRequestHandler have been 
> changed to extend this class.
>  
> The fed

Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Stu Hood
Oddly enough, the system we are integrating Solr into already uses Hadoop on 
every node. It is definitely worth checking out their IPC/RPC.

I wonder how easy it is to use Hadoop RPC code without running any of the 
porcelain around it...?

Thanks,
Stu


-Original Message-
From: eks dev <[EMAIL PROTECTED]>
Sent: Friday, September 21, 2007 5:32pm
To: solr-dev@lucene.apache.org, [EMAIL PROTECTED]
Subject: Re: HTTP or RMI, Jini, JavaSpaces for distributed search

why not taking a look at hadoop's IPC/RPC ? it is small, simple and elegant 
with no latency like on RMI (we could not do better than 30-50ms per hop), 
Nutch uses it 




  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/ 



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Yonik Seeley
On 9/21/07, eks dev <[EMAIL PROTECTED]> wrote:
> why not taking a look at hadoop's IPC/RPC ? it is small, simple and elegant 
> with no latency like on RMI  (we could not do better than 30-50ms per hop)

Interesting... I wonder if that 30-50ms is due to Naggel (which always
seemed to cause a 40ms delay w/ python's http lib on Linux when doing
a POST since the headers were sent separately from the body).

I've seen reports that RMI was faster:
http://mail-archives.apache.org/mod_mbox/lucene-solr-dev/200609.mbox/[EMAIL 
PROTECTED]

-Yonik


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread Mike Klaas

On 21-Sep-07, at 2:34 PM, Yonik Seeley wrote:


On 9/21/07, Mike Klaas <[EMAIL PROTECTED]> wrote:

On 21-Sep-07, at 11:08 AM, Yonik Seeley wrote:

I wanted to take a step back for a second and think about if HTTP  
was

really the right choice for the transport for distributed search.

I think the high-level approach in SOLR-303 is the right way to go
about it, but I'm unsure if HTTP is the right transport.


I don't know anything about RMI, but is it possible to do 100's of
simultaneous asynchronous requests cheaply?


Good question... probably only important for really big clusters (like
yours), but it would be nice.

Even if we go HTTP, I'm not sure it will be async at first - does
HTTPClient even support async?


I don't think so.  In fact, I need to make a small admendment to my  
original claim: the distribution code actually uses our internal rpc  
(which is pure python), but the other end is a python client that  
connects with solr via http (persistent, localhost connection).  I  
wrote it this way because it was easier, as our internal rpc library  
already has functionality for spitting out requests to 100's of  
clients and collecting the results asynchronously.   I figured that  
directly connecting to Solr via http would be cheaper, but perhaps it  
wouldn't be.


Both the rpc and http levels use connection-pooled persistent  
connections.



I assume when you say async that you mean getting rid of the
thread-per-connection via NIO.  Some protocols do "async" by handing
off the request to another thread to wait on the response and then do
a callback to the original thread - this is async with respect to the
original calling thread, but still requires a thread-per-connection.


Right; this helps but doesn't scale too far.


Of course HTTP has some issues too - you effectively need a separate
connection per outstanding request.  Pipelining won't work well
because things need to come back in-order.  I'm not sure if RMI has
this limitation as well.



FWIW, our distributed search uses http over 120+ shards... and is
written in python.


That would be an awesome test case if you were able to use what Solr
is going to provide out-of-the-box.  Any unusual requirements?


The biggest point of customization is that we run two Solrs in a  
single webapp, one for querying and one for highlighting.  The  
highlighter Solr uses a set of custom parameters to determine the  
docs to use (I imagine the current patch does something like this as  
well).  Splitting the content from the rest of the stored fields is a  
huge win.  There is also lots of custom deduplication and caching  
logic, but this could be done as a post-processing step.


In case anything is thinking of building something this huge, I'll  
mention that it is a bad idea to try to have a single point try to  
manage so many shards.  It is preferable to go hierarchical (could be  
accomplished relatively easily if the query distributor could easily  
query other query distributor nodes).


-Mike


Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread eks dev


>I wonder how easy it is to use Hadoop RPC code without running any of the 
>porcelain around it...?

very easy!





  ___ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html



Re: HTTP or RMI, Jini, JavaSpaces for distributed search

2007-09-21 Thread eks dev

>Interesting... I wonder if that 30-50ms is due to Naggel (which always
>seemed to cause a 40ms delay w/ python's http lib on Linux when doing
>a POST since the headers were sent separately from the body).

it's been a while since we played with it, a year or two ago, maybe something 
changed in meantime or we did it wrong than. But I am sure we tried to disable 
Naggel optimization tcp_no_delay... anyhow, hadoop's code did the trick, I 
guess it is even better now









  ___
Yahoo! Answers - Got a question? Someone out there knows the answer. Try it
now.
http://uk.answers.yahoo.com/



[jira] Commented: (SOLR-215) Multiple Solr Cores - remove static singleton

2007-09-21 Thread Henri Biestro (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529625
 ] 

Henri Biestro commented on SOLR-215:


Replacing the line
  SolrEventListener listener = 
(SolrEventListener)solrConfig.newInstance(className);
With
  SolrEventListener listener = createEventListener(className);
should fix it.

> Multiple Solr Cores - remove static singleton
> -
>
> Key: SOLR-215
> URL: https://issues.apache.org/jira/browse/SOLR-215
> Project: Solr
>  Issue Type: Improvement
>Reporter: Henri Biestro
>Priority: Minor
> Fix For: 1.3
>
> Attachments: solr-215.patch, solr-215.patch, solr-215.patch, 
> solr-215.patch, solr-215.patch.zip, solr-215.patch.zip, solr-215.patch.zip, 
> solr-215.patch.zip, solr-215.patch.zip, solr-215.patch.zip, 
> solr-215.patch.zip, solr-trunk-533775.patch, solr-trunk-538091.patch, 
> solr-trunk-542847-1.patch, solr-trunk-542847.patch, solr-trunk-src.patch
>
>
> WHAT:
> As of 1.2, Solr only instantiates one SolrCore which handles one Lucene index.
> This patch is intended to allow multiple cores in Solr which also brings 
> multiple indexes capability.
> The patch file to grab is solr-215.patch.zip (see MISC session below).
> WHY:
> The current Solr practical wisdom is that one schema - thus one index - is 
> most likely to accomodate your indexing needs, using a filter to segregate 
> documents if needed. If you really need multiple indexes, deploy multiple web 
> applications.
> There are a some use cases however where having multiple indexes or multiple 
> cores through Solr itself may make sense.
> Multiple cores:
> Deployment issues within some organizations where IT will resist deploying 
> multiple web applications.
> Seamless schema update where you can create a new core and switch to it 
> without starting/stopping servers.
> Embedding Solr in your own application (instead of 'raw' Lucene) and 
> functionally need to segregate schemas & collections.
> Multiple indexes:
> Multiple language collections where each document exists in different 
> languages, analysis being language dependant.
> Having document types that have nothing (or very little) in common with 
> respect to their schema, their lifetime/update frequencies or even collection 
> sizes.
> HOW:
> The best analogy is to consider that instead of deploying multiple 
> web-application, you can have one web-application that hosts more than one 
> Solr core. The patch does not change any of the core logic (nor the core 
> code); each core is configured & behaves exactly as the one core in 1.2; the 
> various caches are per-core & so is the info-bean-registry.
> What the patch does is replace the SolrCore singleton by a collection of 
> cores; all the code modifications are driven by the removal of the different 
> singletons (the config, the schema & the core).
> Each core is 'named' and a static map (keyed by name) allows to easily manage 
> them.
> You declare one servlet filter mapping per core you want to expose in the 
> web.xml; this allows easy to access each core through a different url. 
> USAGE (example web deployment, patch installed):
> Step0
> java -Durl='http://localhost:8983/solr/core0/update' -jar post.jar solr.xml 
> monitor.ml
> Will index the 2 documents in solr.xml & monitor.xml
> Step1:
> http://localhost:8983/solr/core0/admin/stats.jsp
> Will produce the statistics page from the admin servlet on core0 index; 2 
> documents
> Step2:
> http://localhost:8983/solr/core1/admin/stats.jsp
> Will produce the statistics page from the admin servlet on core1 index; no 
> documents
> Step3:
> java -Durl='http://localhost:8983/solr/core0/update' -jar post.jar ipod*.xml
> java -Durl='http://localhost:8983/solr/core1/update' -jar post.jar mon*.xml
> Adds the ipod*.xml to index of core0 and the mon*.xml to the index of core1;
> running queries from the admin interface, you can verify indexes have 
> different content. 
> USAGE (Java code):
> //create a configuration
> SolrConfig config = new SolrConfig("solrconfig.xml");
> //create a schema
> IndexSchema schema = new IndexSchema(config, "schema0.xml");
> //create a core from the 2 other.
> SolrCore core = new SolrCore("core0", "/path/to/index", config, schema);
> //Accessing a core:
> SolrCore core = SolrCore.getCore("core0"); 
> PATCH MODIFICATIONS DETAILS (per package):
> org.apache.solr.core:
> The heaviest modifications are in SolrCore & SolrConfig.
> SolrCore is the most obvious modification; instead of a singleton, there is a 
> static map of cores keyed by names and assorted methods. To retain some 
> compatibility, the 'null' named core replaces the singleton for the relevant 
> methods, for instance SolrCore.getCore(). One small constraint on the core 
> name is they can't contain '/' or '\' avoiding pote