On Mon, Sep 13, 2010 at 8:02 AM, Dennis Gearon <gear...@sbcglobal.net> wrote:
> BTW, what is a segment?

On the Lucene level an index is composed of one or more index
segments. Each segment is an index by itself and consists of several
files like doc stores, proximity data, term dictionaries etc. During
indexing Lucene / Solr creates those segments depending on ram buffer
/ document buffer settings and flushes them to disk (if you index to
disk). Once a segment has been flushed Lucene will never change the
segments (well up to a certain level - lets keep this simple) but
write new ones for new added documents. Since segments have a
write-once policy Lucene merges multiple segments into a new segment
(how and when this happens is different story) from time to time to
get rid of deleted documents and to reduce the number of overall
segments in the index.
Generally a higher number of segments will also influence you search
performance since Lucene performs almost all operations on a
per-segment level. If you want to reduce the number of segment to one
you need to call optimize and lucene will merge all existing ones into
one single segment.

hope that answers your question

simon
>
> I've only heard about them in the last 2 weeks here on the list.
> Dennis Gearon
>
> Signature Warning
> ----------------
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Sun, 9/12/10, Jason Rutherglen <jason.rutherg...@gmail.com> wrote:
>
>> From: Jason Rutherglen <jason.rutherg...@gmail.com>
>> Subject: Re: Tuning Solr caches with high commit rates (NRT)
>> To: solr-user@lucene.apache.org
>> Date: Sunday, September 12, 2010, 7:52 PM
>> Yeah there's no patch... I think
>> Yonik can write it. :-)  Yah... The
>> Lucene version shouldn't matter.  The distributed
>> faceting
>> theoretically can easily be applied to multiple segments,
>> however the
>> way it's written for me is a challenge to untangle and
>> apply
>> successfully to a working patch.  Also I don't have
>> this as an itch to
>> scratch at the moment.
>>
>> On Sun, Sep 12, 2010 at 7:18 PM, Peter Sturge <peter.stu...@gmail.com>
>> wrote:
>> > Hi Jason,
>> >
>> > I've tried some limited testing with the 4.x trunk
>> using fcs, and I
>> > must say, I really like the idea of per-segment
>> faceting.
>> > I was hoping to see it in 3.x, but I don't see this
>> option in the
>> > branch_3x trunk. Is your SOLR-1606 patch referred to
>> in SOLR-1617 the
>> > one to use with 3.1?
>> > There seems to be a number of Solr issues tied to this
>> - one of them
>> > being Lucene-1785. Can the per-segment faceting patch
>> work with Lucene
>> > 2.9/branch_3x?
>> >
>> > Thanks,
>> > Peter
>> >
>> >
>> >
>> > On Mon, Sep 13, 2010 at 12:05 AM, Jason Rutherglen
>> > <jason.rutherg...@gmail.com>
>> wrote:
>> >> Peter,
>> >>
>> >> Are you using per-segment faceting, eg, SOLR-1617?
>>  That could help
>> >> your situation.
>> >>
>> >> On Sun, Sep 12, 2010 at 12:26 PM, Peter Sturge
>> <peter.stu...@gmail.com>
>> wrote:
>> >>> Hi,
>> >>>
>> >>> Below are some notes regarding Solr cache
>> tuning that should prove
>> >>> useful for anyone who uses Solr with frequent
>> commits (e.g. <5min).
>> >>>
>> >>> Environment:
>> >>> Solr 1.4.1 or branch_3x trunk.
>> >>> Note the 4.x trunk has lots of neat new
>> features, so the notes here
>> >>> are likely less relevant to the 4.x
>> environment.
>> >>>
>> >>> Overview:
>> >>> Our Solr environment makes extensive use of
>> faceting, we perform
>> >>> commits every 30secs, and the indexes tend be
>> on the large-ish side
>> >>> (>20million docs).
>> >>> Note: For our data, when we commit, we are
>> always adding new data,
>> >>> never changing existing data.
>> >>> This type of environment can be tricky to
>> tune, as Solr is more geared
>> >>> toward fast reads than frequent writes.
>> >>>
>> >>> Symptoms:
>> >>> If anyone has used faceting in searches where
>> you are also performing
>> >>> frequent commits, you've likely encountered
>> the dreaded OutOfMemory or
>> >>> GC Overhead Exeeded errors.
>> >>> In high commit rate environments, this is
>> almost always due to
>> >>> multiple 'onDeck' searchers and autowarming -
>> i.e. new searchers don't
>> >>> finish autowarming their caches before the
>> next commit()
>> >>> comes along and invalidates them.
>> >>> Once this starts happening on a regular basis,
>> it is likely your
>> >>> Solr's JVM will run out of memory eventually,
>> as the number of
>> >>> searchers (and their cache arrays) will keep
>> growing until the JVM
>> >>> dies of thirst.
>> >>> To check if your Solr environment is suffering
>> from this, turn on INFO
>> >>> level logging, and look for: 'PERFORMANCE
>> WARNING: Overlapping
>> >>> onDeckSearchers=x'.
>> >>>
>> >>> In tests, we've only ever seen this problem
>> when using faceting, and
>> >>> facet.method=fc.
>> >>>
>> >>> Some solutions to this are:
>> >>>    Reduce the commit rate to allow searchers
>> to fully warm before the
>> >>> next commit
>> >>>    Reduce or eliminate the autowarming in
>> caches
>> >>>    Both of the above
>> >>>
>> >>> The trouble is, if you're doing NRT commits,
>> you likely have a good
>> >>> reason for it, and reducing/elimintating
>> autowarming will very
>> >>> significantly impact search performance in
>> high commit rate
>> >>> environments.
>> >>>
>> >>> Solution:
>> >>> Here are some setup steps we've used that
>> allow lots of faceting (we
>> >>> typically search with at least 20-35 different
>> facet fields, and date
>> >>> faceting/sorting) on large indexes, and still
>> keep decent search
>> >>> performance:
>> >>>
>> >>> 1. Firstly, you should consider using the enum
>> method for facet
>> >>> searches (facet.method=enum) unless you've got
>> A LOT of memory on your
>> >>> machine. In our tests, this method uses a lot
>> less memory and
>> >>> autowarms more quickly than fc. (Note, I've
>> not tried the new
>> >>> segement-based 'fcs' option, as I can't find
>> support for it in
>> >>> branch_3x - looks nice for 4.x though)
>> >>> Admittedly, for our data, enum is not quite as
>> fast for searching as
>> >>> fc, but short of purchsing a Thaiwanese RAM
>> factory, it's a worthwhile
>> >>> tradeoff.
>> >>> If you do have access to LOTS of memory, AND
>> you can guarantee that
>> >>> the index won't grow beyond the memory
>> capacity (i.e. you have some
>> >>> sort of deletion policy in place), fc can be a
>> lot faster than enum
>> >>> when searching with lots of facets across many
>> terms.
>> >>>
>> >>> 2. Secondly, we've found that LRUCache is
>> faster at autowarming than
>> >>> FastLRUCache - in our tests, about 20% faster.
>> Maybe this is just our
>> >>> environment - your mileage may vary.
>> >>>
>> >>> So, our filterCache section in solrconfig.xml
>> looks like this:
>> >>>    <filterCache
>> >>>      class="solr.LRUCache"
>> >>>      size="3600"
>> >>>      initialSize="1400"
>> >>>      autowarmCount="3600"/>
>> >>>
>> >>> For a 28GB index, running in a quad-core x64
>> VMWare instance, 30
>> >>> warmed facet fields, Solr is running at ~4GB.
>> Stats filterCache size
>> >>> shows usually in the region of ~2400.
>> >>>
>> >>> 3. It's also a good idea to have some sort of
>> >>> firstSearcher/newSearcher event listener
>> queries to allow new data to
>> >>> populate the caches.
>> >>> Of course, what you put in these is dependent
>> on the facets you need/use.
>> >>> We've found a good combination is a
>> firstSearcher with as many facets
>> >>> in the search as your environment can handle,
>> then a subset of the
>> >>> most common facets for the newSearcher.
>> >>>
>> >>> 4. We also set:
>> >>>
>> <useColdSearcher>true</useColdSearcher>
>> >>> just in case.
>> >>>
>> >>> 5. Another key area for search performance
>> with high commits is to use
>> >>> 2 Solr instances - one for the high commit
>> rate indexing, and one for
>> >>> searching.
>> >>> The read-only searching instance can be a
>> remote replica, or a local
>> >>> read-only instance that reads the same core as
>> the indexing instance
>> >>> (for the latter, you'll need something that
>> periodically refreshes -
>> >>> i.e. runs commit()).
>> >>> This way, you can tune the indexing instance
>> for writing performance
>> >>> and the searching instance as above for max
>> read performance.
>> >>>
>> >>> Using the setup above, we get fantastic
>> searching speed for small
>> >>> facet sets (well under 1sec), and really good
>> searching for large
>> >>> facet sets (a couple of secs depending on
>> index size, number of
>> >>> facets, unique terms etc. etc.),
>> >>> even when searching against largeish indexes
>> (>20million docs).
>> >>> We have yet to see any OOM or GC errors using
>> the techniques above,
>> >>> even in low memory conditions.
>> >>>
>> >>> I hope there are people that find this useful.
>> I know I've spent a lot
>> >>> of time looking for stuff like this, so
>> hopefullly, this will save
>> >>> someone some time.
>> >>>
>> >>>
>> >>> Peter
>> >>>
>> >>
>> >
>>
>

Reply via email to