Re: Retrieving large num of docs

Andrey Klochkov Sat, 28 Nov 2009 09:48:11 -0800

Hi Raghu

Let me describe our use case in more details. Probably that will clarify
things.

The usual use case for Lucene/Solr is retrieving of small portion of the
result set (10-20 documents). In our case we need to read the whole result
set and this creates huge load on Lucene index, meaning a lot of IO. Keep in
mind that we have large number of stored fields in the index.

In our case there's one thing that makes things simpler: our index is so
small that we can get every document in cache. This means that even if we
retrieve all documents for every result set, we don't retrieve them from
Lucene index and then the performance should be Ok. But here we've got 2
problems:

1. Solr caches Lucene's Document instances. And in case of retrieving the
whole result set it recreates SolrDocument instances every time. This
creates a load on CPU and in particular on Java GC.
2. EmbeddedSolrServer converts the whole response into a byte array and then
restores it back converting Lucene's documents and DocList's to Solr's
SolrDocument and SolrDocumentList instances. This create additional load on
CPU and GC.

We patched Solr to eliminate those things and that fixed our performance
problems.

I think that if you don't place all your documents in caches and/or you
don't use stored fields, retrieving ID field only, then probably those
improvements won't help you.

I suggest you first to find your bottlenecks. Look at IO, memory usage etc.
Using a profiler is the best thing too. Probably you can use some tools from
lucidimation for profiling.

On Sat, Nov 28, 2009 at 4:47 PM, Raghuveer Kancherla <
raghuveer.kanche...@aplopio.com> wrote:

> Hi Andrew,
> I applied the patch you suggested. I am not finding any significant changes
> in the response times.
> I am wondering if I forgot some important configuration setting etc.
> Here is what I did:
>
>   1. Wrote a small program using solrj to use EmbeddedSolrServer (most of
>   the code is from the solr wiki) and run the server on an index of ~700k
> docs
>   and note down the avg response time
>   2. Applied the SOLR-797.patch to the source code of Solr1.4
>   3. complied the source code and rebuilt the jar files.
>   4. Rerun step 1 using the new jar files.
>
> Am I supposed to do any other config changes in order to see the
> performance
> jump that you are able to achieve.
>
> Thanks a lot,
> Raghu
>
>
> On Fri, Nov 27, 2009 at 3:16 PM, AHMET ARSLAN <iori...@yahoo.com> wrote:
>
> > > Hi Andrew,
> > > We are running solr using its http interface from python.
> > > From the resources
> > > I could find, EmbeddedSolrServer is possible only if I am
> > > using solr from a
> > > java program.  It will be useful to understand if a
> > > significant part of the
> > > performance increase is due to bypassing HTTP before going
> > > down this path.
> > >
> > > In the mean time I am trying my luck with the other
> > > suggestions. Can you
> > > share the patch that helps cache solr documents instead of
> > > lucene documents?
> >
> > May be these links can help
> > http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
> > http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
> > http://www.lucidimagination.com/Downloads/LucidGaze-for-Solr
> >
> > how often do you update your index?
> > is your index optimized?
> > configuring caching can also help:
> >
> > http://wiki.apache.org/solr/SolrCaching
> > http://wiki.apache.org/solr/SolrPerformanceFactors
> >
> >
> >
> >
> >
> >
>

-- 
Andrew Klochkov
Senior Software Engineer,
Grid Dynamics

Re: Retrieving large num of docs

Reply via email to