How to stop solr/jetty

2007-03-05 Thread Jack L
Hello,

I guess this is more of a (naive) jetty question - I'm running
a modified instance of the "example" project with jetty
on a linux server. How should I stop this jetty instance?
Note that I may have multiple jetty instances and other
java processes running and I'd like to stop a particular
instance of solr.

-- 
Thanks,
Jack

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: How to stop solr/jetty

2007-03-05 Thread Otis Gospodnetic
Ctrl-c if you still have it on the command line, or kill PID (which you can see 
in either top of by doing a ps auxwww| grep java).

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Jack L <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, March 5, 2007 5:21:56 AM
Subject: How to stop solr/jetty

Hello,

I guess this is more of a (naive) jetty question - I'm running
a modified instance of the "example" project with jetty
on a linux server. How should I stop this jetty instance?
Note that I may have multiple jetty instances and other
java processes running and I'd like to stop a particular
instance of solr.

-- 
Thanks,
Jack

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 





solr data serialization (python,json,ruby) as a standalone module

2007-03-05 Thread Xiaoming Liu

hi,

I am trying to use Solr data serialization module in a separate project,
such as serializing my own Java object to xml, python, json, or ruby
formats.

I have a look at org.apache.solr.request and figure out what I have to do to
use Solr's serialization in a standalone way:

-- Turn the Java Object to a org.apache.solr.util.NamedList object

-- Extends SolrQueryRequestBase to create an empty SolrQueryRequest object
class EmptySolrQueryRequest extends SolrQueryRequestBase {
   public EmptySolrQueryRequest() {
   super(null, null);
   }

   public IndexSchema getSchema() {
   return null;
   }

   public SolrIndexSearcher getSearcher() {
   return null;
   }

   public String getParam(String input) {
   if ("indent".equals(input))
   return "true";
   else
   return null;
   }
}

-- create a SolrQueryResponse object with my NamedList object and serialize
it



The whole process is pretty straightforward and doesn't need much code, as
Solr is clean in handling serialization, though I still need to poke inside
of Solr to write the code. So I wonder if I am doing the thing in a right
way? and is there a plan of pulling up that part of the code and make it
more straightforward for other applications?

thanks,
xiaoming


Re: Federated Search

2007-03-05 Thread Tim Patton



Venkatesh Seetharam wrote:

Hi Tim,

Howdy. I saw your post on Solr newsgroup and caught my attention. I'm 
working on a similar problem for searching a vault of over 100 million 
XML documents. I already have the encoding part done using Hadoop and 
Lucene. It works like a  charm. I create N index partitions and have 
been trying to wrap Solr to search each partition, have a Search broker 
that merges the results and returns.


I'm curious about how have you solved the distribution of additions, 
deletions and updates to each of the indexing servers.I use a 
partitioner based on a hash of the document id. Do you broadcast to the 
slaves as to who owns a document?


Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com 
) for distributing the search across these Solr 
servers. I'm not using HTTP.


Any ideas are greatly appreciated.

PS: I did subscribe to solr newsgroup now but  did not receive a 
confirmation and hence sending it to you directly.


--
Thanks,
Venkatesh

"Perfection (in design) is achieved not when there is nothing more to 
add, but rather when there is nothing more to take away."

- Antoine de Saint-Exupéry



I used a SQL database to keep track of which server had which document. 
   Then I originally used JMS and would use a selector for which server 
number the document should go to.  I switched over to a home grown, 
lightweight message server since JMS behaves really badly when it backs 
up and I couldn't find a server that would simply pause the producers if 
there was a problem with the consumers.  Additions are pretty much 
assigned randomly to whichever server gets them first.  At this point I 
am up to around 20 million documents.


The hash idea sounds really interesting and if I had a fixed number of 
indexes it would be perfect.  But I don't know how big the index will 
grow and I wanted to be able to add servers at any point.  I would like 
to eliminate any outside dependencies (SQL, JMS), which is why a 
distributed Solr would let me focus on other areas.


How did you work around not being able to update a lucene index that is 
stored in Hadoop?  I know there were changes in Lucene 2.1 to support 
this but I haven't looked that far into it yet, I've just been testing 
the new IndexWriter.  As an aside, I hope those features can be used by 
Solr soon (if they aren't already in the nightlys).


Tim



Re: JVM random crashes

2007-03-05 Thread Bill Au

Seems like this maybe a JVM bug:

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6500147

http://forum.java.sun.com/thread.jspa?threadID=659990&messageID=3876052

Have you tried using a different garbage collector?

Bill

On 3/3/07, Jed Reynolds <[EMAIL PROTECTED]> wrote:


Yonik Seeley wrote:
> On 3/3/07, Dimitar Ouzounov <[EMAIL PROTECTED]> wrote:
>> But what hardware problem could it be? Tomorrow I'll make sure that the
>> memory is fine, but nothing
>> else comes to my mind.
>
> Memory, motherboard, etc.
> Try http://www.memtest86.com/ to test this.
>
>> It may be OS-related - probably a buggy version of
>> some library. But which library?
>
> Yep, we've seen that in the past.
> I'd recommend going with OS versions that vendors test with.
> The commercial RHEL or the free clone of it http://www.centos.org/,
> would be my recommendation.
>

I'm running a lot of CentOS 4.4 myself, on i686 and x86_64 processors.
I'm testing out Solr on an i686 with JDK 1.5 and I'm running a
production copy of Nutch on x86_64 JDK 1.5, Tomcat 1.5. It's been rock
solid.

From trying to install Java in the past on FC5, I read a lot about how
you had to be rather careful to make absolutely certain that you had no
conflicting gjc libs in your path. If this is a production box, I'd got
with a longer-supported OS than FC6. If the server is only for searching
and apache, I don't think FC6 will give you any noticeable performance
boost over CentOS 4.4. FC6's performance enhancements with
glibc-hash-binding won't affect a JVM.


Jed



Re[2]: How to stop solr/jetty

2007-03-05 Thread Jack L
Hello Otis,

Thanks. The ps command works well.

-- 
Best regards,
Jack

Monday, March 5, 2007, 2:26:24 AM, you wrote:

> Ctrl-c if you still have it on the command line, or kill PID
> (which you can see in either top of by doing a ps auxwww| grep java).

> Otis
>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

> - Original Message 
> From: Jack L <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Monday, March 5, 2007 5:21:56 AM
> Subject: How to stop solr/jetty

> Hello,

> I guess this is more of a (naive) jetty question - I'm running
> a modified instance of the "example" project with jetty
> on a linux server. How should I stop this jetty instance?
> Note that I may have multiple jetty instances and other
> java processes running and I'd like to stop a particular
> instance of solr.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re[2]: Federated Search

2007-03-05 Thread Jack L
This is very interesting discussion. I have a few question while
reading Tim and Venkatesh's email:

To Tim:
1. is there any reason you don't want to use HTTP? Since solr has
   an HTTP interface already, I suppose using HTTP is the simplest
   way to communicate the solr servers from the merger/search broker.
   hadoop and ice would both require some additional work - this is
   if you are using solr and not lucent directly.

2. "Do you broadcast to the slaves as to who owns a document?"
   Do the searchers need to know who has what document?
   
To Venkatesh:
1. I suppose solr is ok to handle 20 million document - I hope I'm
   right because that's what I'm planning on doing :) Is it because
   of storage capacity why you you choose to use multiple solr
   servers?

An open question: what's the best way to manage server addition?
- If a hash value-based partitioning is used, re-indexing all
  the document will be needed.
- Otherwise, a database seems to be required to track the documents.

-- 
Best regards,
Jack

Monday, March 5, 2007, 7:47:36 AM, you wrote:



> Venkatesh Seetharam wrote:
>> Hi Tim,
>> 
>> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
>> working on a similar problem for searching a vault of over 100 million
>> XML documents. I already have the encoding part done using Hadoop and
>> Lucene. It works like a  charm. I create N index partitions and have
>> been trying to wrap Solr to search each partition, have a Search broker
>> that merges the results and returns.
>> 
>> I'm curious about how have you solved the distribution of additions,
>> deletions and updates to each of the indexing servers.I use a 
>> partitioner based on a hash of the document id. Do you broadcast to the
>> slaves as to who owns a document?
>> 
>> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com 
>> ) for distributing the search across these Solr
>> servers. I'm not using HTTP.
>> 
>> Any ideas are greatly appreciated.
>> 
>> PS: I did subscribe to solr newsgroup now but  did not receive a 
>> confirmation and hence sending it to you directly.
>> 
>> -- 
>> Thanks,
>> Venkatesh
>> 
>> "Perfection (in design) is achieved not when there is nothing more to
>> add, but rather when there is nothing more to take away."
>> - Antoine de Saint-Exupéry


> I used a SQL database to keep track of which server had which document.
> Then I originally used JMS and would use a selector for which server
> number the document should go to.  I switched over to a home grown, 
> lightweight message server since JMS behaves really badly when it backs
> up and I couldn't find a server that would simply pause the producers if
> there was a problem with the consumers.  Additions are pretty much 
> assigned randomly to whichever server gets them first.  At this point I
> am up to around 20 million documents.

> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect.  But I don't know how big the index will
> grow and I wanted to be able to add servers at any point.  I would like
> to eliminate any outside dependencies (SQL, JMS), which is why a 
> distributed Solr would let me focus on other areas.

> How did you work around not being able to update a lucene index that is
> stored in Hadoop?  I know there were changes in Lucene 2.1 to support
> this but I haven't looked that far into it yet, I've just been testing
> the new IndexWriter.  As an aside, I hope those features can be used by
> Solr soon (if they aren't already in the nightlys).

> Tim

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: logging off

2007-03-05 Thread Bill Au

FYI, the admin page has a link, [LOGGING], that can be use to change Solr's
logging on the fly.

Bill

On 3/4/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: Hi Brian, all you have to do is create a logging.properties file and
: call this before starting up solr:
:
: System.setProperty("java.util.logging.config.file", home+"/conf/
: logging.properties");

it's not neccessary to execute any javacode to configurate
java.util.logging ... that property can be set on the commandline before
executing java.

: And that will disable console logging. For jetty logging, you need to
: create a custom Log class that looks like this

i'm no jetty expert, but this should also be controllable via jetty
configs, without writting any custom code .. greping the xml files in the
example/etc for "log" should turn up a lot of pointers ... much of what is
there is commented out by default -- and i believe that's why it goes to
STDOUT instead of specific files.


-Hoss




Re[2]: Federated Search

2007-03-05 Thread Tim Patton



Jack L wrote:

This is very interesting discussion. I have a few question while
reading Tim and Venkatesh's email:

To Tim:
1. is there any reason you don't want to use HTTP? Since solr has
   an HTTP interface already, I suppose using HTTP is the simplest
   way to communicate the solr servers from the merger/search broker.
   hadoop and ice would both require some additional work - this is
   if you are using solr and not lucent directly.

2. "Do you broadcast to the slaves as to who owns a document?"
   Do the searchers need to know who has what document?
   
To Venkatesh:

1. I suppose solr is ok to handle 20 million document - I hope I'm
   right because that's what I'm planning on doing :) Is it because
   of storage capacity why you you choose to use multiple solr
   servers?

An open question: what's the best way to manage server addition?
- If a hash value-based partitioning is used, re-indexing all
  the document will be needed.
- Otherwise, a database seems to be required to track the documents.



Jack,

My big stumbling blocks were with indexing more so than searching.  I 
did put together an RMI based system to search multiple lucene servers. 
 And the searchers don't need to know where everything is.  However 
with indexing at some point something needs to know where to send the 
documents for updating or who to tell to delete a document, whether it 
is the server that does the processing or some sort of broker.   The 
processing machines could do the DB look up and talk to Solr over HTTP 
no problem and this is part of what I am considering doing.  However I 
have some extra code on the indexing machines to handle DB updates 
etc..., though I might find a way to move this elsewhere in the system 
so I can have pretty much a pure solr server with just a few custom 
items (like my own Similarity or QueryParser).


I suppose the DB could be moved to lucene from SQL in the future as well.



Dynamic RequestHandler loading

2007-03-05 Thread Ryan McKinley

I know I'm pushing solr to do things it was never designed to do, so
shut me up quick if this is not where you want things to go - I could
quietly implement this with quick hacks, but i'd rather not...

Currently SolrCore loads all the request handlers in a final variable
as the instance is initialized:

private final RequestHandlers reqHandlers = new
RequestHandlers(SolrConfig.config);

This is a little strange because forces the request handlers to be
loaded as the instance is initialized - not in a more appropriate
place like after initWriters() in the constructor.  The bad side
affects to this are that handlers can not know about the schema or
config directory in the init() method.  They are forced to do some
sort of lazy loading.

The other thing I would like is to be able to dynamically register
handlers.  For example, i have one handler that wants to register 6
related handlers (each one maps a different action - it only makes
sense to have them as a collection).  While i could put six entries in
solrconfig, this will quickly become unruly.

How do you feel about:

1. move the request handler initialization into the constructor after
initWriters()?

2. Exposing the following functions in SolrCore:

// this already exists
 SolrRequestHandler getRequestHandler(String handlerName);

// this will register the handler and return whatever used to be at
that path (or null)
 SolrRequestHandler registerRequestHandler(String handlerName,
SolrRequestHandler )

// get all the registered handlers by class
 Collection getRequestHandlers( Class clazz );

3. Give request handlers a standard way to know what path/name they
are registered to.  The simplest way i can think to do this is to add
parameters to the NamedList passed to init()


thoughts?

ryan


Re: merely a suggestion: schema.xml validator or better schema validation logging

2007-03-05 Thread Chris Hostetter

: I had  (minimum match) blow up at query time with a number
: format exception (this is from memory).

That's a RequestHandler specific request param that can also be specified
as a default/invarient/appended init param ... i'm not sure that SolrCore
could do much to validate that when parsing the solrconfig.xml.
DisMaxRequestHandler could possible throw an exception from it's init
method if it sees param it recognizes but can't parse ... but that's a
dangerous road to go down ... what if i want to subclass
DisMaxRequestHandler and change hte format of the "mm" param?

One thing you could do to ensure that your RequestHandler configuration
makes sense without waiting for an error generated by a request, is to put
in some explicit cache warming as part of the firstSearcher listener that
hits each configured requestHandler with the minimal amount of input you
expect ...  then you'll see an error in your log immediately


: I had silent a error that I can't remember the details of, but it
: was something like putting the  for boost functions outside
: the . It didn't blow up, but it was a nonsense config that
: was accepted.

again, there's nothing erroneous about having a  outside of a 
when specifing the init params of a RequestHandler as far as SolrCore is
concerned ... it has no idea what types of init params the RequestHandler
wants ... and the StandardRequestHandler could say that if it sees
any top level init params which aren't "defaults", "invarients" or
"appended" then it could complain ... but again: what if i subclass
StandardRequestHandler and i want to add some custom init param to
determine behavior in my subclass?


-Hoss



Re: 'accumulate' copyField for faceting

2007-03-05 Thread Mike Klaas

A patch is up at SOLR-176

On 3/2/07, Mike Klaas <[EMAIL PROTECTED]> wrote:

On 3/2/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

> Yes, this would be really helpfull.  It would be nice to be able to
> put this in in the response output too.

Two votes is enough for me.  I'll see if I can get to it this weekend.

-Mike



Re: merely a suggestion: schema.xml validator or better schema validation logging

2007-03-05 Thread Ryan McKinley

: I had silent a error that I can't remember the details of, but it
: was something like putting the  for boost functions outside
: the . It didn't blow up, but it was a nonsense config that
: was accepted.

again, there's nothing erroneous about having a  outside of a 
when specifing the init params of a RequestHandler as far as SolrCore is
concerned ... it has no idea what types of init params the RequestHandler
wants ... and the StandardRequestHandler could say that if it sees
any top level init params which aren't "defaults", "invarients" or
"appended" then it could complain ... but again: what if i subclass
StandardRequestHandler and i want to add some custom init param to
determine behavior in my subclass?



One trick i have used elsewhere is to output the loaded config and
compare it to the initalazation config - if they are different, there
may be a problem.

We could pretty easily add a utility method like this to
RequestHandlerBase and let RequestHandler's 'validate' their config in
init() - It would not be an automatic thing that applies to every
request handler, but adding some validation to DisMaxRequestHandler
and  StandardRequestHandler would take care of most problems
(especially for beginners)

ryan


Re: Dynamic RequestHandler loading

2007-03-05 Thread Yonik Seeley

On 3/5/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:

I know I'm pushing solr to do things it was never designed to do, so
shut me up quick if this is not where you want things to go - I could
quietly implement this with quick hacks, but i'd rather not...

Currently SolrCore loads all the request handlers in a final variable
as the instance is initialized:

private final RequestHandlers reqHandlers = new
RequestHandlers(SolrConfig.config);

This is a little strange because forces the request handlers to be
loaded as the instance is initialized - not in a more appropriate
place like after initWriters() in the constructor.  The bad side
affects to this are that handlers can not know about the schema or
config directory in the init() method.  They are forced to do some
sort of lazy loading.

The other thing I would like is to be able to dynamically register
handlers.  For example, i have one handler that wants to register 6
related handlers (each one maps a different action - it only makes
sense to have them as a collection).  While i could put six entries in
solrconfig, this will quickly become unruly.

How do you feel about:

1. move the request handler initialization into the constructor after
initWriters()?

2. Exposing the following functions in SolrCore:

 // this already exists
  SolrRequestHandler getRequestHandler(String handlerName);

 // this will register the handler and return whatever used to be at
that path (or null)
  SolrRequestHandler registerRequestHandler(String handlerName,
SolrRequestHandler )

 // get all the registered handlers by class
  Collection getRequestHandlers( Class clazz );


By class?  What's that for?

Everything else seems to make sense... it would mean an extra
synchronization per Solr request, but that shouldn't be measurable
given everything else we are doing.

One thing we lost going from individual servlets to a filter was the
potential for load-on-demand handlers.  If/when CSV and SQL update
handlers get put in the core, it might be nice if they weren't loaded
at startup.

-Yonik


Re: Dynamic RequestHandler loading

2007-03-05 Thread Ryan McKinley

>
>  // get all the registered handlers by class
>   Collection getRequestHandlers( Class SolrRequestHandler> clazz );

By class?  What's that for?



It was useful to check what else is configured.  The alternative is to have a
 Collection getRequestHandlers() and have let the
client sort out what is what.  In the case I am looking at, I want to
check if a handler of given type is already registered - if not, i
register one.

getRequestHandlers() would be equivolent to:
getRequestHandlers( SolrRequestHandler.class )

We will need some way to ask what is registered without knowing the
path it is registered to.


Everything else seems to make sense... it would mean an extra
synchronization per Solr request, but that shouldn't be measurable
given everything else we are doing.



would access to getRequestHandler( name ) need to be synchronized?

Are there any real problems if not?  I don't imagine much would change
after startup.



One thing we lost going from individual servlets to a filter was the
potential for load-on-demand handlers.  If/when CSV and SQL update
handlers get put in the core, it might be nice if they weren't loaded
at startup.



perhaps we could add:


Then we could have a LazyRequestHandlerWrapper that only hangs on to
configuration parameters, but done not initialize the delegate handler
until its first request.  This would be transparent to the
Filter/Servlet framework

But I'll save that for another day :)


Error with bin/optimize and multiple solr webapps

2007-03-05 Thread Jeff Rodenburg

I noticed an issue with the optimize bash script in /bin.  Per the line:

rs=`curl http://${solr_hostname}:${solr_port}/solr/update -s -d
""`

This line assumes a single solr installation under Tomcat, whereas the
multiple webapp scenario runs from a different location (the "/solr" part).
I'm sure this applies elsewhere.

I would submit a patch for JIRA, but couldn't find these files under version
control.  Any recommendations?

-- j


Re: Federated Search

2007-03-05 Thread Venkatesh Seetharam

Hi Tim,

Thanks for your response. Interesting idea. Does the DB scale?  Do you have
one single index which you plan to use Solr for or you have multiple
indexes?


But I don't know how big the index will grow and I wanted to be able to

add servers at any point.
I'm thinking of having N partitions with a max of 10 million documents per
partition. Adding a server should not be a problem but the newly added
server would take time to grow so that distribution of documents are equal
in the cluster. I've tested with 50 million documents of 10 size each and
looks very promising.


The hash idea sounds really interesting and if I had a fixed number of

indexes it would be perfect.
I'm infact looking around for a reverse-hash algorithm where in given a
docId, I should be able to find which partition contains the document so I
can save cycles on broadcasting slaves.

I mean, even if you use a DB, how have you solved the problem of
distribution when a new server is added into the mix.

We have the same problem since we get daily updates to documents and
document metadata.


How did you work around not being able to update a lucene index that is

stored in Hadoop?
I do not use HDFS. I use a NetApp mounted on all the nodes in the cluster
and hence did not need any change to Lucene.

I plan to index using Lucene/Hadoop and use Solr as the partition searcher
and a broker which would merge the results and return 'em.

Thanks,
Venkatesh

On 3/5/07, Tim Patton <[EMAIL PROTECTED]> wrote:




Venkatesh Seetharam wrote:
> Hi Tim,
>
> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
> working on a similar problem for searching a vault of over 100 million
> XML documents. I already have the encoding part done using Hadoop and
> Lucene. It works like a  charm. I create N index partitions and have
> been trying to wrap Solr to search each partition, have a Search broker
> that merges the results and returns.
>
> I'm curious about how have you solved the distribution of additions,
> deletions and updates to each of the indexing servers.I use a
> partitioner based on a hash of the document id. Do you broadcast to the
> slaves as to who owns a document?
>
> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
> ) for distributing the search across these Solr
> servers. I'm not using HTTP.
>
> Any ideas are greatly appreciated.
>
> PS: I did subscribe to solr newsgroup now but  did not receive a
> confirmation and hence sending it to you directly.
>
> --
> Thanks,
> Venkatesh
>
> "Perfection (in design) is achieved not when there is nothing more to
> add, but rather when there is nothing more to take away."
> - Antoine de Saint-Exupéry


I used a SQL database to keep track of which server had which document.
Then I originally used JMS and would use a selector for which server
number the document should go to.  I switched over to a home grown,
lightweight message server since JMS behaves really badly when it backs
up and I couldn't find a server that would simply pause the producers if
there was a problem with the consumers.  Additions are pretty much
assigned randomly to whichever server gets them first.  At this point I
am up to around 20 million documents.

The hash idea sounds really interesting and if I had a fixed number of
indexes it would be perfect.  But I don't know how big the index will
grow and I wanted to be able to add servers at any point.  I would like
to eliminate any outside dependencies (SQL, JMS), which is why a
distributed Solr would let me focus on other areas.

How did you work around not being able to update a lucene index that is
stored in Hadoop?  I know there were changes in Lucene 2.1 to support
this but I haven't looked that far into it yet, I've just been testing
the new IndexWriter.  As an aside, I hope those features can be used by
Solr soon (if they aren't already in the nightlys).

Tim




Re: Error with bin/optimize and multiple solr webapps

2007-03-05 Thread Chris Hostetter

: This line assumes a single solr installation under Tomcat, whereas the
: multiple webapp scenario runs from a different location (the "/solr" part).
: I'm sure this applies elsewhere.

good catch ... it looks like all of our scripts assume "/solr/update" is
the correct path to POST commit/optimize messages to.

: I would submit a patch for JIRA, but couldn't find these files under version
: control.  Any recommendations?

They live in src/scripts ... a patch would ceritanly be apprecaited.

FYI: there is an evolution underway to allow XML based update messages to
be sent to any path (and the fixed path "/update" is being deprecated)
so it would be handy if the entire URL path was configurable (not just hte
webapp name)


-Hoss



Re: Error with bin/optimize and multiple solr webapps

2007-03-05 Thread Jeff Rodenburg

Thanks Hoss.  I'll add an issue in JIRA and attach the patch.



On 3/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:



: This line assumes a single solr installation under Tomcat, whereas the
: multiple webapp scenario runs from a different location (the "/solr"
part).
: I'm sure this applies elsewhere.

good catch ... it looks like all of our scripts assume "/solr/update" is
the correct path to POST commit/optimize messages to.

: I would submit a patch for JIRA, but couldn't find these files under
version
: control.  Any recommendations?

They live in src/scripts ... a patch would ceritanly be apprecaited.

FYI: there is an evolution underway to allow XML based update messages to
be sent to any path (and the fixed path "/update" is being deprecated)
so it would be handy if the entire URL path was configurable (not just hte
webapp name)


-Hoss




Re: Re[2]: Federated Search

2007-03-05 Thread Venkatesh Seetharam

Hi Jack,

Howdy. Comments are inline.


is there any reason you don't want to use HTTP?

I've seen that Hadoop RPC is faster then HTTP. Also, Since Solr returns XML
response, you incur overhead in parsing that and then merging. I havent sone
scale testing with HTTP and XML response.


Do the searchers need to know who has what document?

This is necessary if you are doing updates to the document in the index.


I suppose solr is ok to handle 20 million document

Storage is not an issue. If the size of the index is huge, then it will take
time and when you want 100 searches/second, its really hard. I've read in
Lucene newsgroup that lucene works well with an index around 8-10GB. It
slows down when its bigger than that. Since my index can run into many GB,
I'd partition that.


- If a hash value-based partitioning is used, re-indexing all  the

document will be needed.
Why is that necessary? If a document has to be updated, you can broadcast to
slaves as to who owns it and then send an update to that slave.

Venkatesh

On 3/5/07, Jack L <[EMAIL PROTECTED]> wrote:


This is very interesting discussion. I have a few question while
reading Tim and Venkatesh's email:

To Tim:
1. is there any reason you don't want to use HTTP? Since solr has
   an HTTP interface already, I suppose using HTTP is the simplest
   way to communicate the solr servers from the merger/search broker.
   hadoop and ice would both require some additional work - this is
   if you are using solr and not lucent directly.

2. "Do you broadcast to the slaves as to who owns a document?"
   Do the searchers need to know who has what document?

To Venkatesh:
1. I suppose solr is ok to handle 20 million document - I hope I'm
   right because that's what I'm planning on doing :) Is it because
   of storage capacity why you you choose to use multiple solr
   servers?

An open question: what's the best way to manage server addition?
- If a hash value-based partitioning is used, re-indexing all
  the document will be needed.
- Otherwise, a database seems to be required to track the documents.

--
Best regards,
Jack

Monday, March 5, 2007, 7:47:36 AM, you wrote:



> Venkatesh Seetharam wrote:
>> Hi Tim,
>>
>> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
>> working on a similar problem for searching a vault of over 100 million
>> XML documents. I already have the encoding part done using Hadoop and
>> Lucene. It works like a  charm. I create N index partitions and have
>> been trying to wrap Solr to search each partition, have a Search broker
>> that merges the results and returns.
>>
>> I'm curious about how have you solved the distribution of additions,
>> deletions and updates to each of the indexing servers.I use a
>> partitioner based on a hash of the document id. Do you broadcast to the

>> slaves as to who owns a document?
>>
>> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
>> ) for distributing the search across these Solr
>> servers. I'm not using HTTP.
>>
>> Any ideas are greatly appreciated.
>>
>> PS: I did subscribe to solr newsgroup now but  did not receive a
>> confirmation and hence sending it to you directly.
>>
>> --
>> Thanks,
>> Venkatesh
>>
>> "Perfection (in design) is achieved not when there is nothing more to
>> add, but rather when there is nothing more to take away."
>> - Antoine de Saint-Exupéry


> I used a SQL database to keep track of which server had which document.
> Then I originally used JMS and would use a selector for which server

> number the document should go to.  I switched over to a home grown,
> lightweight message server since JMS behaves really badly when it backs
> up and I couldn't find a server that would simply pause the producers if

> there was a problem with the consumers.  Additions are pretty much
> assigned randomly to whichever server gets them first.  At this point I
> am up to around 20 million documents.

> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect.  But I don't know how big the index will
> grow and I wanted to be able to add servers at any point.  I would like
> to eliminate any outside dependencies (SQL, JMS), which is why a
> distributed Solr would let me focus on other areas.

> How did you work around not being able to update a lucene index that is
> stored in Hadoop?  I know there were changes in Lucene 2.1 to support
> this but I haven't looked that far into it yet, I've just been testing
> the new IndexWriter.  As an aside, I hope those features can be used by
> Solr soon (if they aren't already in the nightlys).

> Tim

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com