Re: Solr feasibility with terabyte-scale data

James Brady Fri, 09 May 2008 10:57:44 -0700

So our problem is made easier by having complete indexpartitionability by a user_id field. That means at one end of thespectrum, we could have one monolithic index for everyone, while atthe other end of the spectrum we could individual cores for eachuser_id.

At the moment, we've gone for a halfway house somewhere in the middle:I've got several large EC2 instances (currently 3), each running asingle master/slave pair of Solr servers. The servers have severalcores (currently 10 - a guesstimated good number). As new usersregister, I automatically distribute them across cores. I would liketo do something with clustering users based on geo-location so thatcores will get 'time off' for maintenance and optimization for thatuser cluster's nighttime. I'd also like to move in the 1 core per userdirection as dynamic core creation becomes available.

It seems a lot of what you're describing is really similar toMapReduce, so I think Otis' suggestion to look at Hadoop is a goodone: it might prevent a lot of headaches and they've already solved alot of the tricky problems. There a number of ridiculously sizedprojects using it to solve their scale problems, not least Yahoo...


James

On 9 May 2008, at 01:17, Marcus Herou wrote:

Cool.
Since you must certainly already have a good partitioning scheme,could you
elaborate on high level how you set this up ?
I'm certain that I will shoot myself in the foot both once and twicebefore
getting it right but this is what I'm good at; to never stop trying :)
However it is nice to start playing at least on the right side of the
football field so a little push in the back would be really helpful.

Kindly

//Marcus
On Fri, May 9, 2008 at 9:36 AM, James Brady <[EMAIL PROTECTED]>
wrote:
Hi, we have an index of ~300GB, which is at least approaching theballpark
you're in.
Lucky for us, to coin a phrase we have an 'embarassinglypartitionable'index so we can just scale out horizontally across commodityhardware withno problems at all. We're also using the multicore featuresavailable indevelopment Solr version to reduce granularity of core size by anorder ofmagnitude: this makes for lots of small commits, rather than fewlong ones.
There was mention somewhere in the thread of document collections: if
you're going to be filtering by collection, I'd strongly recommend
partitioning too. It makes scaling so much less painful!

James


On 8 May 2008, at 23:37, marcusherou wrote:
Hi.
I will as well head into a path like yours within some months fromnow.Currently I have an index of ~10M docs and only store id's in theindex
for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1Gdocuments.
Each
document have in average about 3-5k data.

We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage atleast, onemount point). Hope this will be the right choice, only future cantell.
Since we are developing a search engine I frankly don't think evenhaving100's of SOLR instances serving the index will cut it performancewise if
we
have one big index. I totally agree with the others claiming thatyou mostdefinitely will go OOE or hit some other constraints of SOLR ifyou musthave the whole result in memory sort it and create a xml response.I did
hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizinga TBindex will take a long long time and you really want to have anoptimized
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the indexover thedisk(s) and let each SOLR instance only have little piece of thetotalindex. This will require a master database or namenode (or simplerjust a
properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. Thisis toensure that whenever you introduce a new SOLR instance with a newshard
the
master indexer will know what shard to prioritize. This isprobably notenough either since all new docs will go to the new shard until itis
filled
(have the same size as the others) only then will all shardsreceive docs
in
a loadbalanced fashion. So whenever you want to add a new indexeryouprobably need to initiate a "stealing" process where it stealsdocs from
the
others until it reaches some sort of threshold (10 servers = eachshard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. Ithinkdoing a distributed reindexing will as well be a good thing whenit comes
to
cutting both indexing and optimizing speed. Probably each indexershouldbuffer it's shard locally on RAID1 SCSI disks, optimize it andthen justcopy it to the main index to minimize the burden of the sharedstorage.
Let's say the indexing part will be all fancy and working i TBscale now
we
come to searching. I personally believe after talking to otherguys whichhave built big search engines that you need to introduce acontroller likesearcher on the client side which itself searches in all of theshards andmerges the response. Perhaps Distributed Solr solves this and willlove to
test it whenever my new installation of servers and enclosures is
finished.

Currently my idea is something like this.
public Page<Document> search(SearchDocumentCommand sdc)
 {
     Set<Integer> ids = documentIndexers.keySet();
     int nrOfSearchers = ids.size();
     int totalItems = 0;
Page<Document> docs = new Page(sdc.getPage(),sdc.getPageSize());
     for (Iterator<Integer> iterator = ids.iterator();
iterator.hasNext();)
     {
         Integer id = iterator.next();
         List<DocumentIndexer> indexers = documentIndexers.get(id);
         DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
         SearchDocumentCommand sdc2 = copy(sdc);
         sdc2.setPage(sdc.getPage()/nrOfSearchers);
         Page<Document> res = indexer.search(sdc);
         totalItems += res.getTotalItems();
         docs.addAll(res);
     }

     if(sdc.getComparator() != null)
     {
         Collections.sort(docs, sdc.getComparator());
     }

     docs.setTotalItems(totalItems);

     return docs;
 }
This is my RaidedDocumentIndexer which wraps a set ofDocumentIndexers. Iswitch from Solr to raw Lucene back and forth benchmarking andcomparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is broken Ithink.However the searching speed will at best be constant proportionalto thenumber of searchers, probably a lot worse. To get even more speedeachdocument indexer should be put into a separate thread withsomething likeEDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with athreadpool. The Future result times out after let's say 750 msec and theclientignores all searchers which are slower. Probably some performancemetricsshould be gathered about each searcher so the client knows whichindexers
to
prefer over the others.
But of course if you have 50 searchers, having each client threadspawn
yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio
configurable.
The controller patterns is used by Google I think I think PeterZaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size andhopefullysomeone smacks me on my head and says "Hey dude do it like thisinstead".
Kindly

//Marcus








Phillip Farber wrote:
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scaledatasetof OCR. Initially our requirements are simple: basic tokenizing,score
sorting only, no faceting.   The schema is simple too.  A document
consists of a numeric id, stored and indexed and a large textfield,indexed not stored, containing the OCR typically ~1.4Mb. Somelimited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR(about 1Mdocs) which we expect to increase to 10Tb over time. Pilot testson thedesktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180Mb ofdata via HTTP suggest we can index at a rate sufficient to keepup withthe inputs (after getting over the 1.1 Tb hump). We envisionnightly
commits/optimizes.

We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbEswitch(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We arestarting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave SolrCollectionDistribution to add more Solr instances clearly means duplicatinganimmense index. Is it possible to use one instance to update theindex
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as wescale
up?  What kind of query performance could we expect?  Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr couldhandle
the task?

Any advice/wisdom greatly appreciated,

Phil
--
View this message in context:
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.
--
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[EMAIL PROTECTED]
http://www.tailsweep.com/
http://blogg.tailsweep.com/

Re: Solr feasibility with terabyte-scale data

Reply via email to