Re: Embedded about 50% faster for indexing

Walter Underwood Tue, 28 Aug 2007 09:16:22 -0700

No need to run a separate web server. I actually do HTTP updates from
an extra servlet configured into the Solr webserver. It might
seem a little odd, but same-system TCP sockets are extremely fast
and low overhead.


The additional flexibility is nice, too. If I find a bug in the
indexing code in production, I can fix it locally and update
from the fixed copy over HTTP while I wait for a push of code
to production.

Modern HTTP and TCP are very fast and very reliable, so don't
count out the HTTP/XML interface before trying it.

wunder
==
Search Guy
Netflix

On 8/27/07 9:18 PM, "climbingrose" <[EMAIL PROTECTED]> wrote:

> Agree. I was actually thinking of developing the embedded version early this
> year for one of my projects. I'm sure it will be needed in cases where
> running another web server is an overkill.
> 
> On 8/28/07, Jonathan Woods <[EMAIL PROTECTED]> wrote:
>> 
>> I don't think you should apologise for highlighting embedded usage.  For
>> circumstances in which you're at liberty to run a Solr instance in the
>> same
>> JVM as an app which uses it, I find it very strange that you should have
>> to
>> use anything _other_ than embedded, and jump through all the unnecessary
>> hoops (XML conversion, HTTP transport) that this implies.  It's a bit like
>> suggesting you should throw away Java method invocations altogether, and
>> write everything in XML-RPC.
>> 
>> Bit of a pet issue of mine!  I'll be creating a JIRA issue on the subject
>> soon.
>> 
>> Jon
>> 
>>> -----Original Message-----
>>> From: Sundling, Paul [mailto:[EMAIL PROTECTED]
>>> Sent: 28 August 2007 03:24
>>> To: solr-user@lucene.apache.org
>>> Subject: RE: Embedded about 50% faster for indexing
>>> 
>>> At this point I think I'm going recommend against embedded,
>>> regardless of any performance advantage.  The level of
>>> documentation is just too low, while the XML API is clearly
>>> documented.  It's clear that XML is preferred.
>>> 
>>> The embedded example on the wiki is pretty good, but until
>>> mutliple core support comes out in the next version, you have
>>> to use multiple SolrCore.  If they are accessed in the same
>>> webapp, then you can't just set JNDI (since you can only have
>>> one value).  So you have to use a Config object as alluded to
>>> in the example.  However, you look at the code and there is
>>> no javadoc for the constructor.  The constructor args are
>>> (String name, InputStream is, String prefix).  I think name
>>> is a unique name for the solr core, but that is a guess.
>>> Inputstream may be a stream to the solr home, but it could be
>>> anything.  Prefix may be a URI prefix.  These are all guesses
>>> without trying to read through the code.
>>> 
>>> When I look at SolrCore, it looks like it's a singleton, so
>>> maybe I can't even access more than one SolrCore using
>>> embedded anyway.  :(  So I apologize for highlighting Embedded.
>>> 
>>> Anyway it's clear how to do multiple solr cores using XML.
>>> You just have different post URI for the difference cores.
>>> You can easily inject that with Spring and externalize the
>>> config.  Simple and easy.  So I concede XML is the way to go. :)
>>> 
>>> Paul Sundling
>>> 
>>> -----Original Message-----
>>> From: Mike Klaas [mailto:[EMAIL PROTECTED]
>>> Sent: Monday, August 27, 2007 5:50 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Embedded about 50% faster for indexing
>>> 
>>> 
>>> On 27-Aug-07, at 12:44 PM, Sundling, Paul wrote:
>>> 
>>>> Whether embedded solr should give me a performance boost or not, it
>>>> did.
>>>> :)  I'm not surprised, since it skips XML parsing.
>>> Although you never
>>>> know where cycles are used for sure until you profile.
>>> 
>>> It certainly is possible that XML parsing dwarfs indexing, but I'd
>>> expect that only to occur under very light analysis and field
>>> storage
>>> workloads.
>>> 
>>>> I tried doing more records per post (200) and it was
>>> actually slightly
>>> 
>>>> slower and seemed to require more memory.  This makes sense because
>>>> you
>>>> have to take up more memory for the StringBuilder to store the much
>>>> larger XML.  For 10,000 it was much slower.  For that size I would
>>>> need
>>>> to XML streaming or something to make it work.
>>>> 
>>>> The solr war was on the same machine, so network overhead was only
>>>> from
>>>> using loopback.
>>> 
>>> The big question is still your connection handling strategy:
>>> are you
>>> using persistent http connections?  Are you threadedly indexing?
>>> 
>>> cheers,
>>> -Mike
>>> 
>>>> Paul Sundling
>>>> 
>>>> -----Original Message-----
>>>> From: climbingrose [mailto:[EMAIL PROTECTED]
>>>> Sent: Monday, August 27, 2007 12:22 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: Embedded about 50% faster for indexing
>>>> 
>>>> 
>>>> Haven't tried the embedded server but I think I have to agree with
>>>> Mike.
>>>> We're currently sending 2000 job batches to SOLR server and
>>> the amount
>>>> of time required to transfer documents over http is insignificant
>>>> compared with the time required to index them. So I do
>>> think unless
>>>> you
>>>> are sending document one by one, embedded SOLR shouldn't
>>> give you much
>>>> more performance boost.
>>>> 
>>>> On 8/25/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
>>>>> 
>>>>> On 24-Aug-07, at 2:29 PM, Wu, Daniel wrote:
>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of
>>>>>>> Yonik Seeley
>>>>>>> Sent: Friday, August 24, 2007 2:07 PM
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Subject: Re: Embedded about 50% faster for indexing
>>>>>>> 
>>>>>>> One thing I'd like to avoid is everyone trying to embed just for
>>>>>>> performance gains. If there is really that much
>>> difference, then we
>>>> 
>>>>>>> need a better way for people to get that without
>>> resorting to Java
>>>>>>> code.
>>>>>>> 
>>>>>>> -Yonik
>>>>>>> 
>>>>>> 
>>>>>> Theoretically and practically, embedded solution will be
>>> faster than
>>>> 
>>>>>> going through http/xml.
>>>>> 
>>>>> This is only true if the http interface adds significant
>>> overhead to
>>>>> the cost of indexing a document, and I don't see why this
>>> should be
>>>>> so, as indexing is relatively heavyweight.  setting up the
>>> connection
>>> 
>>>>> could be expensive, but this can be greatly mitigated by
>>> sending more
>>> 
>>>>> than one doc per http request, using persistent connections, and
>>>>> threading.
>>>>> 
>>>>> -Mike
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards,
>>>> 
>>>> Cuong Hoang
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>

Re: Embedded about 50% faster for indexing

Reply via email to