Hi Toke,

Thanks for sharing this experience, it's very useful for me to have a first 
overview of what will I need.
If I could resume, I will:
- learn about Tika
- Ask a lot of question like the frequency of add/update solr data
- Number of Users
- CPU/RAM/HDD
- A first test with a representative sample

And of course a good expertise :)

Thanks,
Bruno


-----Message d'origine-----
De : Toke Eskildsen [mailto:t...@kb.dk]
Envoyé : samedi 22 juin 2019 11:36
À : solr_user lucene_apache
Objet : Re: Is Solr can do that ?

Matheo Software Info <i...@matheo-software.com> wrote:
> My question is very simple ☺ I would like to know if Solr can process
> around 30To of data (Pdf, Text, Word, etc…) ?

Simple answer: Yes. Assuming 30To means 30 terabyte.

> What is the best way to index this huge data ? several servers ?
> several shards ? other ?

As other participants has mentioned, it is hard to give numbers. What we can do 
is share experience.

We are doing webarchive indexing and I guess there would be quite an overlap 
with your content as we also use Tika. One difference is that the images in a 
webarchive are quite cheap to index, so you'll probably need (relatively) more 
hardware than we use. Very roughly we used 40 CPU-years to index 600 (700? I 
forget) TB of data in one of our runs. Scaling to your 30TB this suggests 
something like 2 CPU-years, or a couple of months for a 16 core machine.

This is just to get a ballpark: You will do yourself a huge favor by building a 
test-setup and process 1 TB or so of your data to get _your_ numbers, before 
you design your indexing setup. It is our experience that the analyzing part 
(Tika) takes much more power than the Solr indexing part: At our last run we 
had 30-40 CPU-cores doing Tika (and related analysis) feeding into a Solr 
running on a 4-core machine on spinning drives.


As for Solr setup for search, then you need to describe in detail what your 
requirements are, before we can give you suggestions. Is the index updated all 
the time, in batches or one-off? How many concurrent users? Are the searches 
interactive or batch-jobs? What kind of aggregations do you need?

In our setup we build separate collections that are merged to single segments 
and never updated. Our use varies between very few interactive users and a lot 
of batch jobs. Scaling this specialized setup to your corpus size would require 
about 3TB of SSD, 64MB RAM and 4 CPU-cores, divided among 4 shards. You are 
likely to need quite a lot more than that, so this is just to say that at this 
scale the use of the index matters _a lot_.

- Toke Eskildsen


---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus

Reply via email to