Re: Index size and performance degradation

Michael McCandless Mon, 13 Jun 2011 09:02:54 -0700

Here's a blog post describing some details of Twitter's approach:

    http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html


And here's a talk Michael did last October (Lucene Revolutions):

    
http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter

Twitter's case is simpler since they never delete ;)  So we have to
fix that to do it in Lucene... there are also various open issues that
begin to explore some of the ideas here.

But this ("immediate consistency") would be a deep and complex change,
and I don't see many apps that actually require it.

Mike McCandless

http://blog.mikemccandless.com

On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko <[email protected]> wrote:
> Thanks for your detailed answer. We'll have to tackle this and see whats
> more important to us then. I'd definitely love to hear Zoie has overcame all
> that...
>
>
> Any pointers to Michael Busch's approach? I take this has something to do
> with the core itself or index format, probably using the Flex version?
>
>
> Itamar.
>
>
> On 12/06/2011 23:12, Michael McCandless wrote:
>
>> > From what I understand of Zoie (and it's been some time since I last
>> looked... so this could be wrong now), the biggest difference vs NRT
>> is that Zoie aims for "immediate consistency", ie index changes are
>> always made visible to the very next query, vs NRT which is
>> "controlled consistency", a blend between immediate and eventual
>> consistency where your app decides when the changes must become
>> visible.
>>
>> But in exchange for that, Zoie pays a price: each search has a higher
>> cost per collected hit, since it must post-filter for deleted docs.
>> And since Zoie necessarily adds complexity, there's more risk; eg
>> there were some nasty Zoie bugs that took quite some time to track
>> down (under https://issues.apache.org/jira/browse/LUCENE-2729).
>>
>> Anyway, I don't think that's a good tradeoff, in general, for our
>> users, because very few apps truly require immediate consistency from
>> Lucene (can anyone give an example where their app depends on
>> immediate consistency...?).  I think it's better to spend time during
>> reopen so that searches aren't slower.
>>
>> That said, Lucene has already incorporated one big part of Zoie
>> (caching small segments in RAM) via the new NRTCachingDirectory (in
>> contrib/misc).  Also, the upcoming NRTManager
>> (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over
>> visibility of specific indexing changes to queries that need to see
>> the changes.
>>
>> Finally, even better would be to not have to make any tradeoff
>> whatsoever ;)  Twitter's approach (created by Michael Busch) seems to
>> bring immediate consistency with no search performance hit, so if we
>> do anything here likely it'll be similar to what Michael has done
>> (though, those changes are not simple either!).
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko<[email protected]>
>>  wrote:
>>>
>>> Mike,
>>>
>>>
>>> Speaking of NRT, and completely off-topic, I know: Lucene's NRT
>>> apparently
>>> isn't fast enough if Zoie was needed, and now that Zoie is around are
>>> there
>>> any plans to make it Lucene's default? or: why would one still use NRT
>>> when
>>> Zoie seem to work much better?
>>>
>>>
>>> Itamar.
>>>
>>>
>>> On 12/06/2011 13:16, Michael McCandless wrote:
>>>
>>>> Remember that memory-mapping is not a panacea: at the end of the day,
>>>> if there just isn't enough RAM on the machine to keep your full
>>>> "working set" hot, then the OS will have to hit the disk, regardless
>>>> of whether the access is through MMap or a "traditional" IO request.
>>>>
>>>> That said, on Fedora Linux anyway, I generally see better performance
>>>> from MMap than from NIOFSDir; eg see the 2nd chart here:
>>>>
>>>>
>>>>
>>>> http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>> On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko<[email protected]>
>>>>  wrote:
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>> The whole point of my question was to find out if and how to make
>>>>> balancing
>>>>> on the SAME machine. Apparently thats not going to help and at a
>>>>> certain
>>>>> point we will just have to prompt the user to buy more hardware...
>>>>>
>>>>>
>>>>> Out of curiosity, isn't there anything that we can do to avoid that?
>>>>> for
>>>>> instance using memory-mapped files for the indexes? anything that would
>>>>> help
>>>>> us overcome OS limitations of that sort...
>>>>>
>>>>>
>>>>> Also, you mention a scheduled job to check for performance degradation;
>>>>> any
>>>>> idea how serious such a drop should be for sharding to be really
>>>>> beneficial?
>>>>> or is it application specific too?
>>>>>
>>>>>
>>>>> Itamar.
>>>>>
>>>>>
>>>>> On 12/06/2011 06:43, Shai Erera wrote:
>>>>>
>>>>>> I agree w/ Erick, there is no cutoff point (index size for that
>>>>>> matter)
>>>>>> above which you start sharding.
>>>>>>
>>>>>> What you can do is create a scheduled job in your system that runs a
>>>>>> select
>>>>>> list of queries and monitors their performance. Once it degrades, it
>>>>>> shards
>>>>>> the index by either splitting it (you can use IndexSplitter under
>>>>>> contrib)
>>>>>> or create a new shard, and direct new documents to it.
>>>>>>
>>>>>> I think I read somewhere, not sure if it was in Solr or ElasticSearch
>>>>>> documentation, about a Balancer object, which moves shards around in
>>>>>> order
>>>>>> to balance the load on the cluster. You can implement something
>>>>>> similar
>>>>>> which tries to balance the index sizes, creates new shards on-the-fly,
>>>>>> even
>>>>>> merge shards if suddenly a whole source is being removed from the
>>>>>> system
>>>>>> etc.
>>>>>>
>>>>>> Also, note that the 'largest index size' threshold is really a machine
>>>>>> constraint and not Lucene's. So if you decide that 10 GB is your
>>>>>> cutoff,
>>>>>> it
>>>>>> is pointless to create 10x10GB shards on the same machine -- searching
>>>>>> them
>>>>>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's
>>>>>> even
>>>>>> worse because you consume more RAM when the indexes are split (e.g.,
>>>>>> terms
>>>>>> index, field infos etc.).
>>>>>>
>>>>>> Shai
>>>>>>
>>>>>> On Sun, Jun 12, 2011 at 3:10 AM, Erick
>>>>>> Erickson<[email protected]>wrote:
>>>>>>
>>>>>>> <<<We can't assume anything about the machine running it,
>>>>>>> so testing won't really tell us much>>>
>>>>>>>
>>>>>>> Hmmm, then it's pretty hopeless I think. Problem is that
>>>>>>> anything you say about running on a machine with
>>>>>>> 2G available memory on a single processor is completely
>>>>>>> incomparable to running on a machine with 64G of
>>>>>>> memory available for Lucene and 16 processors.
>>>>>>>
>>>>>>> There's really no such thing as an "optimum" Lucene index
>>>>>>> size, it always relates to the characteristics of the
>>>>>>> underlying hardware.
>>>>>>>
>>>>>>> I think the best you can do is actually test on various
>>>>>>> configurations, then at least you can say "on configuration
>>>>>>> X this is the tipping point".
>>>>>>>
>>>>>>> Sorry there isn't a better answer that I know of, but...
>>>>>>>
>>>>>>> Best
>>>>>>> Erick
>>>>>>>
>>>>>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar
>>>>>>> Syn-Hershko<[email protected]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I know Lucene indexes to be at their optimum up to a certain size -
>>>>>>>> said
>>>>>>>
>>>>>>> to
>>>>>>>>
>>>>>>>> be around several GBs. I haven't found a good discussion over this,
>>>>>>>> but
>>>>>>>
>>>>>>> its
>>>>>>>>
>>>>>>>> my understanding that at some point its better to split an index
>>>>>>>> into
>>>>>>>
>>>>>>> parts
>>>>>>>>
>>>>>>>> (a la sharding) than to continue searching on a huge-size index. I
>>>>>>>> assume
>>>>>>>> this has to do with OS and IO configurations. Can anyone point me to
>>>>>>>> more
>>>>>>>> info on this?
>>>>>>>>
>>>>>>>> We have a product that is using Lucene for various searches, and at
>>>>>>>> the
>>>>>>>> moment each type of search is using its own Lucene index. We plan on
>>>>>>>> refactoring the way it works and to combine all indexes into one -
>>>>>>>> making
>>>>>>>> the whole system more robust and with a smaller memory footprint,
>>>>>>>> among
>>>>>>>> other things.
>>>>>>>>
>>>>>>>> Assuming the above is true, we are interested in knowing how to do
>>>>>>>> this
>>>>>>>> correctly. Initially all our indexes will be run in one big index,
>>>>>>>> but
>>>>>>>> if
>>>>>>>
>>>>>>> at
>>>>>>>>
>>>>>>>> some index size there is a severe performance degradation we would
>>>>>>>> like
>>>>>>>
>>>>>>> to
>>>>>>>>
>>>>>>>> handle that correctly by starting a new FSDirectory index to flush
>>>>>>>> into,
>>>>>>>
>>>>>>> or
>>>>>>>>
>>>>>>>> by re-indexing and moving large indexes into their own Lucene index.
>>>>>>>>
>>>>>>>> Are there are any guidelines for measuring or estimating this
>>>>>>>> correctly?
>>>>>>>> what we should be aware of while considering all that? We can't
>>>>>>>> assume
>>>>>>>> anything about the machine running it, so testing won't really tell
>>>>>>>> us
>>>>>>>> much...
>>>>>>>>
>>>>>>>> Thanks in advance for any input on this,
>>>>>>>>
>>>>>>>> Itamar.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Index size and performance degradation

Reply via email to