Erik and Shawn

Thanks for the input. In the process below we are posting the documents to
Solr over HTTP Connection in batches.

Trying to solve the same problem but in a different way :-

I have used lucene back in the day, where I would index the documents
locally on the disk and run search queries on them. Big fan of lucene.

I was wondering if there is any possibility like that.

If I have a repository of millions of documents, would it not make sense
to just index them locally and then copy the index file over to Solr and
have it read from it?

Any thoughts or blogs that could help me, or may be I am over thinking
this?

Thanks
Shivaji


On 1/13/16, 9:12 AM, "Erick Erickson" <erickerick...@gmail.com> wrote:

>It's usually not all that difficult to write a multi-threaded
>client that uses CloudSolrClient, or even fire up multiple
>instances of the SolrJ client (assuming they can work
>on discreet sections of the documents you need to index).
>
>That avoids the problem Shawn alludes to. Plus other
>issues. If you do _not_ use CloudSolrClient, then all the
>docs go to some node in the system that then sub-divides
>the list (and you really should update in batches, see:
>https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/)
>then the node that receives the packet sub-divides it
>into groups based on what shard they should be part of
>and forwards them to the leaders for that shard, very
>significantly increasing the numbers of conversations
>being carried on between Solr nodes. Times the number
>of threads you're specifying with CUSC (I really regret
>the renaming from ConcurrentUpdateSolrServer, I liked
>writing CUSS).
>
>With CloudSolrClient, you can scale nearly linearly with
>the number of shards. Not so with CUSC.
>
>FWIW,
>Erick
>
>On Tue, Jan 12, 2016 at 8:06 PM, Shawn Heisey <apa...@elyograg.org> wrote:
>> On 1/12/2016 7:42 PM, Shivaji Dutta wrote:
>>> Now since with ConcurrentUdateSolrClient I am able to use a queue and
>>>a pool of threads, which makes it more attractive to use over
>>>CloudSolrClient which will use a HTTPSolrClient once it gets a set of
>>>nodes to do the updates.
>>>
>>> What is the recommended API for updating large amounts of documents
>>>with higher throughput rate.
>>
>> ConcurrentUpdateSolrClient has one flaw -- it swallows all exceptions
>> that happen during indexing.  Your application will never know about any
>> problems that occur during indexing.  The entire cluster could be down,
>> and your application would never know about it until you tried an
>> explicit commit operation.  Commit is an operation that is not handled
>> in the background by CUSC, so I would expect any exception to be
>> returned for that operation.
>>
>> This flaw is inherent to its design, the behavior would be very
>> difficult to change.
>>
>> If you don't care about your application getting error messages when
>> indexing requests fail, then CUSC is perfect.  This might be the case if
>> you are doing initial bulk loading.  For normal index updates after
>> initial loading, you would not want to use CUSC.
>>
>> If you do care about getting error messages when bulk indexing requests
>> fail, then you'll want to build a program with CloudSolrClient where you
>> create multiple indexing threads that all use the same the client
>>object.
>>
>> Thanks,
>> Shawn
>>
>

Reply via email to