Here’s a skeletal SolrJ program using Tika as another alternative.

Best,
Erick

> On Jun 7, 2020, at 2:06 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> You have to write an external application that creates multiple threads, 
> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and 
> store the resulting text on some file system and then index it. Reason is 
> that if you upgrade to two major versions of Solr you might need to reindex 
> again. Then you can save time because you don’t need to parse the PDFs again. 
> It can be also useful in case you are not sure yet about the final schema and 
> need to index several times in different schemas etc
> 
> You can also use Apache manifoldCF.
> 
> 
> 
>> Am 07.06.2020 um 19:19 schrieb Fiz N <fiznewy...@gmail.com>:
>> 
>> Hello SOLR Experts,
>> 
>> I am working on a POC to Index millions of PDF documents present in
>> Multiple Folder in fileshare.
>> 
>> Could you please let me the best practices and step to implement it.
>> 
>> Thanks
>> Fiz Nadiyal.

Reply via email to