Cool. Since you must certainly already have a good partitioning scheme, could you elaborate on high level how you set this up ?
I'm certain that I will shoot myself in the foot both once and twice before getting it right but this is what I'm good at; to never stop trying :) However it is nice to start playing at least on the right side of the football field so a little push in the back would be really helpful. Kindly //Marcus On Fri, May 9, 2008 at 9:36 AM, James Brady <[EMAIL PROTECTED]> wrote: > Hi, we have an index of ~300GB, which is at least approaching the ballpark > you're in. > > Lucky for us, to coin a phrase we have an 'embarassingly partitionable' > index so we can just scale out horizontally across commodity hardware with > no problems at all. We're also using the multicore features available in > development Solr version to reduce granularity of core size by an order of > magnitude: this makes for lots of small commits, rather than few long ones. > > There was mention somewhere in the thread of document collections: if > you're going to be filtering by collection, I'd strongly recommend > partitioning too. It makes scaling so much less painful! > > James > > > On 8 May 2008, at 23:37, marcusherou wrote: > > >> Hi. >> >> I will as well head into a path like yours within some months from now. >> Currently I have an index of ~10M docs and only store id's in the index >> for >> performance and distribution reasons. When we enter a new market I'm >> assuming we will soon hit 100M and quite soon after that 1G documents. >> Each >> document have in average about 3-5k data. >> >> We will use a GlusterFS installation with RAID1 (or RAID10) SATA >> enclosures >> as shared storage (think of it as a SAN or shared storage at least, one >> mount point). Hope this will be the right choice, only future can tell. >> >> Since we are developing a search engine I frankly don't think even having >> 100's of SOLR instances serving the index will cut it performance wise if >> we >> have one big index. I totally agree with the others claiming that you most >> definitely will go OOE or hit some other constraints of SOLR if you must >> have the whole result in memory sort it and create a xml response. I did >> hit >> such constraints when I couldn't afford the instances to have enough >> memory >> and I had only 1M of docs back then. And think of it... Optimizing a TB >> index will take a long long time and you really want to have an optimized >> index if you want to reduce search time. >> >> I am thinking of a sharding solution where I fragment the index over the >> disk(s) and let each SOLR instance only have little piece of the total >> index. This will require a master database or namenode (or simpler just a >> properties file in each index dir) of some sort to know what docs is >> located >> on which machine or at least how many docs each shard have. This is to >> ensure that whenever you introduce a new SOLR instance with a new shard >> the >> master indexer will know what shard to prioritize. This is probably not >> enough either since all new docs will go to the new shard until it is >> filled >> (have the same size as the others) only then will all shards receive docs >> in >> a loadbalanced fashion. So whenever you want to add a new indexer you >> probably need to initiate a "stealing" process where it steals docs from >> the >> others until it reaches some sort of threshold (10 servers = each shard >> should have 1/10 of the docs or such). >> >> I think this will cut it and enabling us to grow with the data. I think >> doing a distributed reindexing will as well be a good thing when it comes >> to >> cutting both indexing and optimizing speed. Probably each indexer should >> buffer it's shard locally on RAID1 SCSI disks, optimize it and then just >> copy it to the main index to minimize the burden of the shared storage. >> >> Let's say the indexing part will be all fancy and working i TB scale now >> we >> come to searching. I personally believe after talking to other guys which >> have built big search engines that you need to introduce a controller like >> searcher on the client side which itself searches in all of the shards and >> merges the response. Perhaps Distributed Solr solves this and will love to >> test it whenever my new installation of servers and enclosures is >> finished. >> >> Currently my idea is something like this. >> public Page<Document> search(SearchDocumentCommand sdc) >> { >> Set<Integer> ids = documentIndexers.keySet(); >> int nrOfSearchers = ids.size(); >> int totalItems = 0; >> Page<Document> docs = new Page(sdc.getPage(), sdc.getPageSize()); >> for (Iterator<Integer> iterator = ids.iterator(); >> iterator.hasNext();) >> { >> Integer id = iterator.next(); >> List<DocumentIndexer> indexers = documentIndexers.get(id); >> DocumentIndexer indexer = >> indexers.get(random.nextInt(indexers.size())); >> SearchDocumentCommand sdc2 = copy(sdc); >> sdc2.setPage(sdc.getPage()/nrOfSearchers); >> Page<Document> res = indexer.search(sdc); >> totalItems += res.getTotalItems(); >> docs.addAll(res); >> } >> >> if(sdc.getComparator() != null) >> { >> Collections.sort(docs, sdc.getComparator()); >> } >> >> docs.setTotalItems(totalItems); >> >> return docs; >> } >> >> This is my RaidedDocumentIndexer which wraps a set of DocumentIndexers. I >> switch from Solr to raw Lucene back and forth benchmarking and comparing >> stuff so I have two implementations of DocumentIndexer >> (SolrDocumentIndexer >> and LuceneDocumentIndexer) to make the switch easy. >> >> I think this approach is quite OK but the paging stuff is broken I think. >> However the searching speed will at best be constant proportional to the >> number of searchers, probably a lot worse. To get even more speed each >> document indexer should be put into a separate thread with something like >> EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a thread >> pool. The Future result times out after let's say 750 msec and the client >> ignores all searchers which are slower. Probably some performance metrics >> should be gathered about each searcher so the client knows which indexers >> to >> prefer over the others. >> But of course if you have 50 searchers, having each client thread spawn >> yet >> another 50 threads isn't a good thing either. So perhaps a combo of >> iterative and parallell search needs to be done with the ratio >> configurable. >> >> The controller patterns is used by Google I think I think Peter Zaitzev >> (mysqlperformanceblog) once told me. >> >> Hope I gave some insights in how I plan to scale to TB size and hopefully >> someone smacks me on my head and says "Hey dude do it like this instead". >> >> Kindly >> >> //Marcus >> >> >> >> >> >> >> >> >> Phillip Farber wrote: >> >>> >>> Hello everyone, >>> >>> We are considering Solr 1.2 to index and search a terabyte-scale dataset >>> of OCR. Initially our requirements are simple: basic tokenizing, score >>> sorting only, no faceting. The schema is simple too. A document >>> consists of a numeric id, stored and indexed and a large text field, >>> indexed not stored, containing the OCR typically ~1.4Mb. Some limited >>> faceting or additional metadata fields may be added later. >>> >>> The data in question currently amounts to about 1.1Tb of OCR (about 1M >>> docs) which we expect to increase to 10Tb over time. Pilot tests on the >>> desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb of >>> data via HTTP suggest we can index at a rate sufficient to keep up with >>> the inputs (after getting over the 1.1 Tb hump). We envision nightly >>> commits/optimizes. >>> >>> We expect to have low QPS (<10) rate and probably will not need >>> millisecond query response. >>> >>> Our environment makes available Apache on blade servers (Dell 1955 dual >>> dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*, >>> high-performance NAS system over a dedicated (out-of-band) GbE switch >>> (Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are starting >>> with 2 blades and will add as demands require. >>> >>> While we have a lot of storage, the idea of master/slave Solr Collection >>> Distribution to add more Solr instances clearly means duplicating an >>> immense index. Is it possible to use one instance to update the index >>> on NAS while other instances only read the index and commit to keep >>> their caches warm instead? >>> >>> Should we expect Solr indexing time to slow significantly as we scale >>> up? What kind of query performance could we expect? Is it totally >>> naive even to consider Solr at this kind of scale? >>> >>> Given these parameters is it realistic to think that Solr could handle >>> the task? >>> >>> Any advice/wisdom greatly appreciated, >>> >>> Phil >>> >>> >>> >>> >> -- >> View this message in context: >> http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [EMAIL PROTECTED] http://www.tailsweep.com/ http://blogg.tailsweep.com/