What Sam said. 

Here’s something to get you started on how and why it’s better to be using Tika 
rather than shipping the docs to Solr and having ExtractingRequestHandler do it 
on Solr: https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

> On Jun 21, 2019, at 9:56 AM, Samuel Kasimalla <skasima...@gmail.com> wrote:
> 
> Hi Bruno,
> 
> Assuming you meant 30TB, the first step is to use TIka parser and convert
> the rich documents into plain text.
> 
> We need the number of documents, the unofficial word on the street is about
> 50 million documents per shard, of course a lot of parameters are involved
> in this, it's a simple question but answer is not so simple :).
> 
> Hope this helps.
> 
> Thanks
> Sam
> https://www.linkedin.com/in/skasimalla/
> 
> On Fri, Jun 21, 2019 at 12:49 PM Matheo Software Info <
> i...@matheo-software.com> wrote:
> 
>> Dear Solr User,
>> 
>> 
>> 
>> My question is very simple J I would like to know if Solr can process
>> around 30To of data (Pdf, Text, Word, etc…) ?
>> 
>> 
>> 
>> What is the best way to index this huge data ? several servers ? several
>> shards ? other ?
>> 
>> 
>> 
>> Many thanks for your information,
>> 
>> 
>> 
>> 
>> 
>> Cordialement, Best Regards
>> 
>> Bruno Mannina
>> 
>> www.matheo-software.com
>> 
>> www.patent-pulse.com
>> 
>> Tél. +33 0 970 738 743
>> 
>> Mob. +33 0 634 421 817
>> 
>> [image: facebook (1)] <https://www.facebook.com/PatentPulse>[image:
>> 1425551717] <https://twitter.com/matheosoftware>[image: 1425551737]
>> <https://www.linkedin.com/company/matheo-software>[image: 1425551760]
>> <https://www.youtube.com/user/MatheoSoftware>
>> 
>> 
>> 
>> 
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>>  Garanti
>> sans virus. www.avast.com
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient>
>> <#m_149119889610705423_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Reply via email to