Re: Solr feasibility with terabyte-scale data

Marcus Herou Fri, 09 May 2008 01:17:44 -0700

Cool.

Since you must certainly already have a good partitioning scheme, could you
elaborate on high level how you set this up ?


I'm certain that I will shoot myself in the foot both once and twice before
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.

Kindly

//Marcus



On Fri, May 9, 2008 at 9:36 AM, James Brady <[EMAIL PROTECTED]>
wrote:

> Hi, we have an index of ~300GB, which is at least approaching the ballpark
> you're in.
>
> Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
> index so we can just scale out horizontally across commodity hardware with
> no problems at all. We're also using the multicore features available in
> development Solr version to reduce granularity of core size by an order of
> magnitude: this makes for lots of small commits, rather than few long ones.
>
> There was mention somewhere in the thread of document collections: if
> you're going to be filtering by collection, I'd strongly recommend
> partitioning too. It makes scaling so much less painful!
>
> James
>
>
> On 8 May 2008, at 23:37, marcusherou wrote:
>
>
>> Hi.
>>
>> I will as well head into a path like yours within some months from now.
>> Currently I have an index of ~10M docs and only store id's in the index
>> for
>> performance and distribution reasons. When we enter a new market I'm
>> assuming we will soon hit 100M and quite soon after that 1G documents.
>> Each
>> document have in average about 3-5k data.
>>
>> We will use a GlusterFS installation with RAID1 (or RAID10) SATA
>> enclosures
>> as shared storage (think of it as a SAN or shared storage at least, one
>> mount point). Hope this will be the right choice, only future can tell.
>>
>> Since we are developing a search engine I frankly don't think even having
>> 100's of SOLR instances serving the index will cut it performance wise if
>> we
>> have one big index. I totally agree with the others claiming that you most
>> definitely will go OOE or hit some other constraints of SOLR if you must
>> have the whole result in memory sort it and create a xml response. I did
>> hit
>> such constraints when I couldn't afford the instances to have enough
>> memory
>> and I had only 1M of docs back then. And think of it... Optimizing a TB
>> index will take a long long time and you really want to have an optimized
>> index if you want to reduce search time.
>>
>> I am thinking of a sharding solution where I fragment the index over the
>> disk(s) and let each SOLR instance only have little piece of the total
>> index. This will require a master database or namenode (or simpler just a
>> properties file in each index dir) of some sort to know what docs is
>> located
>> on which machine or at least how many docs each shard have. This is to
>> ensure that whenever you introduce a new SOLR instance with a new shard
>> the
>> master indexer will know what shard to prioritize. This is probably not
>> enough either since all new docs will go to the new shard until it is
>> filled
>> (have the same size as the others) only then will all shards receive docs
>> in
>> a loadbalanced fashion. So whenever you want to add a new indexer you
>> probably need to initiate a "stealing" process where it steals docs from
>> the
>> others until it reaches some sort of threshold (10 servers = each shard
>> should have 1/10 of the docs or such).
>>
>> I think this will cut it and enabling us to grow with the data. I think
>> doing a distributed reindexing will as well be a good thing when it comes
>> to
>> cutting both indexing and optimizing speed. Probably each indexer should
>> buffer it's shard locally on RAID1 SCSI disks, optimize it and then just
>> copy it to the main index to minimize the burden of the shared storage.
>>
>> Let's say the indexing part will be all fancy and working i TB scale now
>> we
>> come to searching. I personally believe after talking to other guys which
>> have built big search engines that you need to introduce a controller like
>> searcher on the client side which itself searches in all of the shards and
>> merges the response. Perhaps Distributed Solr solves this and will love to
>> test it whenever my new installation of servers and enclosures is
>> finished.
>>
>> Currently my idea is something like this.
>> public Page<Document> search(SearchDocumentCommand sdc)
>>   {
>>       Set<Integer> ids = documentIndexers.keySet();
>>       int nrOfSearchers = ids.size();
>>       int totalItems = 0;
>>       Page<Document> docs = new Page(sdc.getPage(), sdc.getPageSize());
>>       for (Iterator<Integer> iterator = ids.iterator();
>> iterator.hasNext();)
>>       {
>>           Integer id = iterator.next();
>>           List<DocumentIndexer> indexers = documentIndexers.get(id);
>>           DocumentIndexer indexer =
>> indexers.get(random.nextInt(indexers.size()));
>>           SearchDocumentCommand sdc2 = copy(sdc);
>>           sdc2.setPage(sdc.getPage()/nrOfSearchers);
>>           Page<Document> res = indexer.search(sdc);
>>           totalItems += res.getTotalItems();
>>           docs.addAll(res);
>>       }
>>
>>       if(sdc.getComparator() != null)
>>       {
>>           Collections.sort(docs, sdc.getComparator());
>>       }
>>
>>       docs.setTotalItems(totalItems);
>>
>>       return docs;
>>   }
>>
>> This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I
>> switch from Solr to raw Lucene back and forth benchmarking and comparing
>> stuff so I have two implementations of DocumentIndexer
>> (SolrDocumentIndexer
>> and LuceneDocumentIndexer) to make the switch easy.
>>
>> I think this approach is quite OK but the paging stuff is broken I think.
>> However the searching speed will at best be constant proportional to the
>> number of searchers, probably a lot worse. To get even more speed each
>> document indexer should be put into a separate thread with something like
>> EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread
>> pool. The Future result times out after let's say 750 msec and the client
>> ignores all searchers which are slower. Probably some performance metrics
>> should be gathered about each searcher so the client knows which indexers
>> to
>> prefer over the others.
>> But of course if you have 50 searchers, having each client thread spawn
>> yet
>> another 50 threads isn't a good thing either. So perhaps a combo of
>> iterative and parallell search needs to be done with the ratio
>> configurable.
>>
>> The controller patterns is used by Google I think I think Peter Zaitzev
>> (mysqlperformanceblog) once told me.
>>
>> Hope I gave some insights in how I plan to scale to TB size and hopefully
>> someone smacks me on my head and says "Hey dude do it like this instead".
>>
>> Kindly
>>
>> //Marcus
>>
>>
>>
>>
>>
>>
>>
>>
>> Phillip Farber wrote:
>>
>>>
>>> Hello everyone,
>>>
>>> We are considering Solr 1.2 to index and search a terabyte-scale dataset
>>> of OCR.  Initially our requirements are simple: basic tokenizing, score
>>> sorting only, no faceting.   The schema is simple too.  A document
>>> consists of a numeric id, stored and indexed and a large text field,
>>> indexed not stored, containing the OCR typically ~1.4Mb.  Some limited
>>> faceting or additional metadata fields may be added later.
>>>
>>> The data in question currently amounts to about 1.1Tb of OCR (about 1M
>>> docs) which we expect to increase to 10Tb over time.  Pilot tests on the
>>> desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of
>>> data via HTTP suggest we can index at a rate sufficient to keep up with
>>> the inputs (after getting over the 1.1 Tb hump).  We envision nightly
>>> commits/optimizes.
>>>
>>> We expect to have low QPS (<10) rate and probably will not need
>>> millisecond query response.
>>>
>>> Our environment makes available Apache on blade servers (Dell 1955 dual
>>> dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
>>> high-performance NAS system over a dedicated (out-of-band) GbE switch
>>> (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting
>>> with 2 blades and will add as demands require.
>>>
>>> While we have a lot of storage, the idea of master/slave Solr Collection
>>> Distribution to add more Solr instances clearly means duplicating an
>>> immense index.  Is it possible to use one instance to update the index
>>> on NAS while other instances only read the index and commit to keep
>>> their caches warm instead?
>>>
>>> Should we expect Solr indexing time to slow significantly as we scale
>>> up?  What kind of query performance could we expect?  Is it totally
>>> naive even to consider Solr at this kind of scale?
>>>
>>> Given these parameters is it realistic to think that Solr could handle
>>> the task?
>>>
>>> Any advice/wisdom greatly appreciated,
>>>
>>> Phil
>>>
>>>
>>>
>>>
>> --
>> View this message in context:
>> http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Solr feasibility with terabyte-scale data

Reply via email to