Re: Index size and performance degradation

Jason Rutherglen Mon, 13 Jun 2011 15:26:11 -0700

> deletions made by readers merely mark it for
> deletion, and once a doc has been marked for deletions it is deleted for all
> intents and purposes, right?


There's the point-in-timeness of a reader to consider.

> Does the N in NRT represent only the cost of reopening a searcher?

Aptly put, and yes basically.

> the only thing that comes to mind is the IW unflushed buffer

This is LUCENE-2312.

On Mon, Jun 13, 2011 at 3:19 PM, Itamar Syn-Hershko <ita...@code972.com> wrote:
> Since there should only be one writer, I'm not sure why you'd need
> transactional storage for that? deletions made by readers merely mark it for
> deletion, and once a doc has been marked for deletions it is deleted for all
> intents and purposes, right? But perhaps I need to refresh my memory on the
> internals, it has been a while.
>
> Does the N in NRT represent only the cost of reopening a searcher? meaning,
> if I could ensure reopening always happens fast and returns a searcher for
> the correct index revision, would it guarantee a real real-time search? or
> is there anything else standing in between? the only thing that comes to
> mind is the IW unflushed buffer - which only Twitter's approach seem to
> handle (not even Zoie).
>
> Itamar.
>
> On 14/06/2011 01:00, Michael McCandless wrote:
>>
>> Yes, adding deletes to Twitter's approach will be a challenge!
>>
>> I don't think we'd do the post-filtering solution, but instead maybe
>> resolve the deletes "live" and store them in a transactional data
>> structure of some kind... but even then we will pay a perf hit to
>> lookup del docs against it.
>>
>> So, yeah, there will presumably be a tradeoff with this approach too.
>> However, turning around changes from the adds should be faster (no
>> segment gets flushed).
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko<ita...@code972.com>
>>  wrote:
>>>
>>> Thanks Mike, much appreciated.
>>>
>>>
>>> Wouldn't Twitter's approach fall for the exact same pit-hole you
>>> described
>>> Zoie does (or did) when it'll handle deletes too? I don't thing there is
>>> any
>>> other way of handling deletes other than post-filtering results. But
>>> perhaps
>>> the IW cache would be smaller than Zoie's RAMDirectory(ies)?
>>>
>>>
>>> I'll give all that a serious dive and report back with results or if more
>>> input will be required...
>>>
>>>
>>> Itamar.
>>>
>>>
>>> On 13/06/2011 19:01, Michael McCandless wrote:
>>>
>>>> Here's a blog post describing some details of Twitter's approach:
>>>>
>>>>
>>>>
>>>> http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html
>>>>
>>>> And here's a talk Michael did last October (Lucene Revolutions):
>>>>
>>>>
>>>>
>>>> http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter
>>>>
>>>> Twitter's case is simpler since they never delete ;)  So we have to
>>>> fix that to do it in Lucene... there are also various open issues that
>>>> begin to explore some of the ideas here.
>>>>
>>>> But this ("immediate consistency") would be a deep and complex change,
>>>> and I don't see many apps that actually require it.
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko<ita...@code972.com>
>>>>  wrote:
>>>>>
>>>>> Thanks for your detailed answer. We'll have to tackle this and see
>>>>> whats
>>>>> more important to us then. I'd definitely love to hear Zoie has
>>>>> overcame
>>>>> all
>>>>> that...
>>>>>
>>>>>
>>>>> Any pointers to Michael Busch's approach? I take this has something to
>>>>> do
>>>>> with the core itself or index format, probably using the Flex version?
>>>>>
>>>>>
>>>>> Itamar.
>>>>>
>>>>>
>>>>> On 12/06/2011 23:12, Michael McCandless wrote:
>>>>>
>>>>>>>  From what I understand of Zoie (and it's been some time since I last
>>>>>>
>>>>>> looked... so this could be wrong now), the biggest difference vs NRT
>>>>>> is that Zoie aims for "immediate consistency", ie index changes are
>>>>>> always made visible to the very next query, vs NRT which is
>>>>>> "controlled consistency", a blend between immediate and eventual
>>>>>> consistency where your app decides when the changes must become
>>>>>> visible.
>>>>>>
>>>>>> But in exchange for that, Zoie pays a price: each search has a higher
>>>>>> cost per collected hit, since it must post-filter for deleted docs.
>>>>>> And since Zoie necessarily adds complexity, there's more risk; eg
>>>>>> there were some nasty Zoie bugs that took quite some time to track
>>>>>> down (under https://issues.apache.org/jira/browse/LUCENE-2729).
>>>>>>
>>>>>> Anyway, I don't think that's a good tradeoff, in general, for our
>>>>>> users, because very few apps truly require immediate consistency from
>>>>>> Lucene (can anyone give an example where their app depends on
>>>>>> immediate consistency...?).  I think it's better to spend time during
>>>>>> reopen so that searches aren't slower.
>>>>>>
>>>>>> That said, Lucene has already incorporated one big part of Zoie
>>>>>> (caching small segments in RAM) via the new NRTCachingDirectory (in
>>>>>> contrib/misc).  Also, the upcoming NRTManager
>>>>>> (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over
>>>>>> visibility of specific indexing changes to queries that need to see
>>>>>> the changes.
>>>>>>
>>>>>> Finally, even better would be to not have to make any tradeoff
>>>>>> whatsoever ;)  Twitter's approach (created by Michael Busch) seems to
>>>>>> bring immediate consistency with no search performance hit, so if we
>>>>>> do anything here likely it'll be similar to what Michael has done
>>>>>> (though, those changes are not simple either!).
>>>>>>
>>>>>> Mike McCandless
>>>>>>
>>>>>> http://blog.mikemccandless.com
>>>>>>
>>>>>> On Sun, Jun 12, 2011 at 2:25 PM, Itamar
>>>>>> Syn-Hershko<ita...@code972.com>
>>>>>>  wrote:
>>>>>>>
>>>>>>> Mike,
>>>>>>>
>>>>>>>
>>>>>>> Speaking of NRT, and completely off-topic, I know: Lucene's NRT
>>>>>>> apparently
>>>>>>> isn't fast enough if Zoie was needed, and now that Zoie is around are
>>>>>>> there
>>>>>>> any plans to make it Lucene's default? or: why would one still use
>>>>>>> NRT
>>>>>>> when
>>>>>>> Zoie seem to work much better?
>>>>>>>
>>>>>>>
>>>>>>> Itamar.
>>>>>>>
>>>>>>>
>>>>>>> On 12/06/2011 13:16, Michael McCandless wrote:
>>>>>>>
>>>>>>>> Remember that memory-mapping is not a panacea: at the end of the
>>>>>>>> day,
>>>>>>>> if there just isn't enough RAM on the machine to keep your full
>>>>>>>> "working set" hot, then the OS will have to hit the disk, regardless
>>>>>>>> of whether the access is through MMap or a "traditional" IO request.
>>>>>>>>
>>>>>>>> That said, on Fedora Linux anyway, I generally see better
>>>>>>>> performance
>>>>>>>> from MMap than from NIOFSDir; eg see the 2nd chart here:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html
>>>>>>>>
>>>>>>>> Mike McCandless
>>>>>>>>
>>>>>>>> http://blog.mikemccandless.com
>>>>>>>>
>>>>>>>> On Sun, Jun 12, 2011 at 4:10 AM, Itamar
>>>>>>>> Syn-Hershko<ita...@code972.com>
>>>>>>>>  wrote:
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The whole point of my question was to find out if and how to make
>>>>>>>>> balancing
>>>>>>>>> on the SAME machine. Apparently thats not going to help and at a
>>>>>>>>> certain
>>>>>>>>> point we will just have to prompt the user to buy more hardware...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Out of curiosity, isn't there anything that we can do to avoid
>>>>>>>>> that?
>>>>>>>>> for
>>>>>>>>> instance using memory-mapped files for the indexes? anything that
>>>>>>>>> would
>>>>>>>>> help
>>>>>>>>> us overcome OS limitations of that sort...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Also, you mention a scheduled job to check for performance
>>>>>>>>> degradation;
>>>>>>>>> any
>>>>>>>>> idea how serious such a drop should be for sharding to be really
>>>>>>>>> beneficial?
>>>>>>>>> or is it application specific too?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Itamar.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 12/06/2011 06:43, Shai Erera wrote:
>>>>>>>>>
>>>>>>>>>> I agree w/ Erick, there is no cutoff point (index size for that
>>>>>>>>>> matter)
>>>>>>>>>> above which you start sharding.
>>>>>>>>>>
>>>>>>>>>> What you can do is create a scheduled job in your system that runs
>>>>>>>>>> a
>>>>>>>>>> select
>>>>>>>>>> list of queries and monitors their performance. Once it degrades,
>>>>>>>>>> it
>>>>>>>>>> shards
>>>>>>>>>> the index by either splitting it (you can use IndexSplitter under
>>>>>>>>>> contrib)
>>>>>>>>>> or create a new shard, and direct new documents to it.
>>>>>>>>>>
>>>>>>>>>> I think I read somewhere, not sure if it was in Solr or
>>>>>>>>>> ElasticSearch
>>>>>>>>>> documentation, about a Balancer object, which moves shards around
>>>>>>>>>> in
>>>>>>>>>> order
>>>>>>>>>> to balance the load on the cluster. You can implement something
>>>>>>>>>> similar
>>>>>>>>>> which tries to balance the index sizes, creates new shards
>>>>>>>>>> on-the-fly,
>>>>>>>>>> even
>>>>>>>>>> merge shards if suddenly a whole source is being removed from the
>>>>>>>>>> system
>>>>>>>>>> etc.
>>>>>>>>>>
>>>>>>>>>> Also, note that the 'largest index size' threshold is really a
>>>>>>>>>> machine
>>>>>>>>>> constraint and not Lucene's. So if you decide that 10 GB is your
>>>>>>>>>> cutoff,
>>>>>>>>>> it
>>>>>>>>>> is pointless to create 10x10GB shards on the same machine --
>>>>>>>>>> searching
>>>>>>>>>> them
>>>>>>>>>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps
>>>>>>>>>> it's
>>>>>>>>>> even
>>>>>>>>>> worse because you consume more RAM when the indexes are split
>>>>>>>>>> (e.g.,
>>>>>>>>>> terms
>>>>>>>>>> index, field infos etc.).
>>>>>>>>>>
>>>>>>>>>> Shai
>>>>>>>>>>
>>>>>>>>>> On Sun, Jun 12, 2011 at 3:10 AM, Erick
>>>>>>>>>> Erickson<erickerick...@gmail.com>wrote:
>>>>>>>>>>
>>>>>>>>>>> <<<We can't assume anything about the machine running it,
>>>>>>>>>>> so testing won't really tell us much>>>
>>>>>>>>>>>
>>>>>>>>>>> Hmmm, then it's pretty hopeless I think. Problem is that
>>>>>>>>>>> anything you say about running on a machine with
>>>>>>>>>>> 2G available memory on a single processor is completely
>>>>>>>>>>> incomparable to running on a machine with 64G of
>>>>>>>>>>> memory available for Lucene and 16 processors.
>>>>>>>>>>>
>>>>>>>>>>> There's really no such thing as an "optimum" Lucene index
>>>>>>>>>>> size, it always relates to the characteristics of the
>>>>>>>>>>> underlying hardware.
>>>>>>>>>>>
>>>>>>>>>>> I think the best you can do is actually test on various
>>>>>>>>>>> configurations, then at least you can say "on configuration
>>>>>>>>>>> X this is the tipping point".
>>>>>>>>>>>
>>>>>>>>>>> Sorry there isn't a better answer that I know of, but...
>>>>>>>>>>>
>>>>>>>>>>> Best
>>>>>>>>>>> Erick
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar
>>>>>>>>>>> Syn-Hershko<ita...@code972.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> I know Lucene indexes to be at their optimum up to a certain
>>>>>>>>>>>> size
>>>>>>>>>>>> -
>>>>>>>>>>>> said
>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>> be around several GBs. I haven't found a good discussion over
>>>>>>>>>>>> this,
>>>>>>>>>>>> but
>>>>>>>>>>>
>>>>>>>>>>> its
>>>>>>>>>>>>
>>>>>>>>>>>> my understanding that at some point its better to split an index
>>>>>>>>>>>> into
>>>>>>>>>>>
>>>>>>>>>>> parts
>>>>>>>>>>>>
>>>>>>>>>>>> (a la sharding) than to continue searching on a huge-size index.
>>>>>>>>>>>> I
>>>>>>>>>>>> assume
>>>>>>>>>>>> this has to do with OS and IO configurations. Can anyone point
>>>>>>>>>>>> me
>>>>>>>>>>>> to
>>>>>>>>>>>> more
>>>>>>>>>>>> info on this?
>>>>>>>>>>>>
>>>>>>>>>>>> We have a product that is using Lucene for various searches, and
>>>>>>>>>>>> at
>>>>>>>>>>>> the
>>>>>>>>>>>> moment each type of search is using its own Lucene index. We
>>>>>>>>>>>> plan
>>>>>>>>>>>> on
>>>>>>>>>>>> refactoring the way it works and to combine all indexes into one
>>>>>>>>>>>> -
>>>>>>>>>>>> making
>>>>>>>>>>>> the whole system more robust and with a smaller memory
>>>>>>>>>>>> footprint,
>>>>>>>>>>>> among
>>>>>>>>>>>> other things.
>>>>>>>>>>>>
>>>>>>>>>>>> Assuming the above is true, we are interested in knowing how to
>>>>>>>>>>>> do
>>>>>>>>>>>> this
>>>>>>>>>>>> correctly. Initially all our indexes will be run in one big
>>>>>>>>>>>> index,
>>>>>>>>>>>> but
>>>>>>>>>>>> if
>>>>>>>>>>>
>>>>>>>>>>> at
>>>>>>>>>>>>
>>>>>>>>>>>> some index size there is a severe performance degradation we
>>>>>>>>>>>> would
>>>>>>>>>>>> like
>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>>
>>>>>>>>>>>> handle that correctly by starting a new FSDirectory index to
>>>>>>>>>>>> flush
>>>>>>>>>>>> into,
>>>>>>>>>>>
>>>>>>>>>>> or
>>>>>>>>>>>>
>>>>>>>>>>>> by re-indexing and moving large indexes into their own Lucene
>>>>>>>>>>>> index.
>>>>>>>>>>>>
>>>>>>>>>>>> Are there are any guidelines for measuring or estimating this
>>>>>>>>>>>> correctly?
>>>>>>>>>>>> what we should be aware of while considering all that? We can't
>>>>>>>>>>>> assume
>>>>>>>>>>>> anything about the machine running it, so testing won't really
>>>>>>>>>>>> tell
>>>>>>>>>>>> us
>>>>>>>>>>>> much...
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks in advance for any input on this,
>>>>>>>>>>>>
>>>>>>>>>>>> Itamar.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>>>>>>> For additional commands, e-mail:
>>>>>>>>>>>> java-user-h...@lucene.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Index size and performance degradation

Reply via email to