Re: ideas for indexing large amount of pdf docs

Bill Bell Sat, 13 Aug 2011 11:51:42 -0700

You could send PDF for processing using a queue solution like Amazon SQS. Kick 
off Amazon instances to process the queue.


Once you process with Tika to text just send the update to Solr.

Bill Bell
Sent from mobile


On Aug 13, 2011, at 10:13 AM, Erick Erickson <erickerick...@gmail.com> wrote:

> Yeah, parsing PDF files can be pretty resource-intensive, so one solution
> is to offload it somewhere else. You can use the Tika libraries in SolrJ
> to parse the PDFs on as many clients as you want, just transmitting the
> results to Solr for indexing.
> 
> HOw are all these docs being submitted? Is this some kind of
> on-the-fly indexing/searching or what? I'm mostly curious what
> your projected max ingestion rate is...
> 
> Best
> Erick
> 
> On Sat, Aug 13, 2011 at 4:49 AM, Rode Gonzalez (libnova)
> <r...@libnova.es> wrote:
>> Hi all,
>> 
>> I want to ask about the best way to implement a solution for indexing a
>> large amount of pdf documents between 10-60 MB each one. 100 to 1000 users
>> connected simultaneously.
>> 
>> I actually have 1 core of solr 3.3.0 and it works fine for a few number of
>> pdf docs but I'm afraid about the moment when we enter in production time.
>> 
>> some possibilities:
>> 
>> i. clustering. I have no experience in this, so it will be a bad idea to
>> venture into this.
>> 
>> ii. multicore solution. make some kind of hash to choose one core at each
>> query (exact queries) and thus reduce the size of the individual indexes to
>> consult or to consult all the cores at same time (complex queries).
>> 
>> iii. do nothing more and wait for the catastrophe in the response times :P
>> 
>> 
>> Someone with experience can help a bit to decide?
>> 
>> Thanks a lot in advance.
>>

Re: ideas for indexing large amount of pdf docs

Reply via email to