Hi.
I will as well head into a path like yours within some months from
now.
Currently I have an index of ~10M docs and only store id's in the
index
for
performance and distribution reasons. When we enter a new market I'm
assuming we will soon hit 100M and quite soon after that 1G
documents.
Each
document have in average about 3-5k data.
We will use a GlusterFS installation with RAID1 (or RAID10) SATA
enclosures
as shared storage (think of it as a SAN or shared storage at
least, one
mount point). Hope this will be the right choice, only future can
tell.
Since we are developing a search engine I frankly don't think even
having
100's of SOLR instances serving the index will cut it performance
wise if
we
have one big index. I totally agree with the others claiming that
you most
definitely will go OOE or hit some other constraints of SOLR if
you must
have the whole result in memory sort it and create a xml response.
I did
hit
such constraints when I couldn't afford the instances to have enough
memory
and I had only 1M of docs back then. And think of it... Optimizing
a TB
index will take a long long time and you really want to have an
optimized
index if you want to reduce search time.
I am thinking of a sharding solution where I fragment the index
over the
disk(s) and let each SOLR instance only have little piece of the
total
index. This will require a master database or namenode (or simpler
just a
properties file in each index dir) of some sort to know what docs is
located
on which machine or at least how many docs each shard have. This
is to
ensure that whenever you introduce a new SOLR instance with a new
shard
the
master indexer will know what shard to prioritize. This is
probably not
enough either since all new docs will go to the new shard until it
is
filled
(have the same size as the others) only then will all shards
receive docs
in
a loadbalanced fashion. So whenever you want to add a new indexer
you
probably need to initiate a "stealing" process where it steals
docs from
the
others until it reaches some sort of threshold (10 servers = each
shard
should have 1/10 of the docs or such).
I think this will cut it and enabling us to grow with the data. I
think
doing a distributed reindexing will as well be a good thing when
it comes
to
cutting both indexing and optimizing speed. Probably each indexer
should
buffer it's shard locally on RAID1 SCSI disks, optimize it and
then just
copy it to the main index to minimize the burden of the shared
storage.
Let's say the indexing part will be all fancy and working i TB
scale now
we
come to searching. I personally believe after talking to other
guys which
have built big search engines that you need to introduce a
controller like
searcher on the client side which itself searches in all of the
shards and
merges the response. Perhaps Distributed Solr solves this and will
love to
test it whenever my new installation of servers and enclosures is
finished.
Currently my idea is something like this.
public Page<Document> search(SearchDocumentCommand sdc)
{
Set<Integer> ids = documentIndexers.keySet();
int nrOfSearchers = ids.size();
int totalItems = 0;
Page<Document> docs = new Page(sdc.getPage(),
sdc.getPageSize());
for (Iterator<Integer> iterator = ids.iterator();
iterator.hasNext();)
{
Integer id = iterator.next();
List<DocumentIndexer> indexers = documentIndexers.get(id);
DocumentIndexer indexer =
indexers.get(random.nextInt(indexers.size()));
SearchDocumentCommand sdc2 = copy(sdc);
sdc2.setPage(sdc.getPage()/nrOfSearchers);
Page<Document> res = indexer.search(sdc);
totalItems += res.getTotalItems();
docs.addAll(res);
}
if(sdc.getComparator() != null)
{
Collections.sort(docs, sdc.getComparator());
}
docs.setTotalItems(totalItems);
return docs;
}
This is my RaidedDocumentIndexer which wraps a set of
DocumentIndexers. I
switch from Solr to raw Lucene back and forth benchmarking and
comparing
stuff so I have two implementations of DocumentIndexer
(SolrDocumentIndexer
and LuceneDocumentIndexer) to make the switch easy.
I think this approach is quite OK but the paging stuff is broken I
think.
However the searching speed will at best be constant proportional
to the
number of searchers, probably a lot worse. To get even more speed
each
document indexer should be put into a separate thread with
something like
EDU.oswego.cs.dl.util.concurrent.FutureResult in cojunction with a
thread
pool. The Future result times out after let's say 750 msec and the
client
ignores all searchers which are slower. Probably some performance
metrics
should be gathered about each searcher so the client knows which
indexers
to
prefer over the others.
But of course if you have 50 searchers, having each client thread
spawn
yet
another 50 threads isn't a good thing either. So perhaps a combo of
iterative and parallell search needs to be done with the ratio
configurable.
The controller patterns is used by Google I think I think Peter
Zaitzev
(mysqlperformanceblog) once told me.
Hope I gave some insights in how I plan to scale to TB size and
hopefully
someone smacks me on my head and says "Hey dude do it like this
instead".
Kindly
//Marcus
Phillip Farber wrote:
Hello everyone,
We are considering Solr 1.2 to index and search a terabyte-scale
dataset
of OCR. Initially our requirements are simple: basic tokenizing,
score
sorting only, no faceting. The schema is simple too. A document
consists of a numeric id, stored and indexed and a large text
field,
indexed not stored, containing the OCR typically ~1.4Mb. Some
limited
faceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR
(about 1M
docs) which we expect to increase to 10Tb over time. Pilot tests
on the
desktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180
Mb of
data via HTTP suggest we can index at a rate sufficient to keep
up with
the inputs (after getting over the 1.1 Tb hump). We envision
nightly
commits/optimizes.
We expect to have low QPS (<10) rate and probably will not need
millisecond query response.
Our environment makes available Apache on blade servers (Dell
1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE
switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are
starting
with 2 blades and will add as demands require.
While we have a lot of storage, the idea of master/slave Solr
Collection
Distribution to add more Solr instances clearly means duplicating
an
immense index. Is it possible to use one instance to update the
index
on NAS while other instances only read the index and commit to keep
their caches warm instead?
Should we expect Solr indexing time to slow significantly as we
scale
up? What kind of query performance could we expect? Is it totally
naive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could
handle
the task?
Any advice/wisdom greatly appreciated,
Phil
--
View this message in context:
http://www.nabble.com/Solr-feasibility-with-terabyte-scale-data-tp14963703p17142176.html
Sent from the Solr - User mailing list archive at Nabble.com.