Thanks Erick.

On 1/13/16, 10:55 AM, "Erick Erickson" <erickerick...@gmail.com> wrote:

>My first thought is "yes, you're overthinking it" ;)....
>
>Here's something to get you started for indexing
>through a Java program:
>https://cwiki.apache.org/confluence/display/solr/Using+SolrJ
>
>Of course you _could_ use Lucene to build your indexes
>and just copy them "to the right place", but there are
>a number of ways that can go wrong, here are a couple:
>1> if you have shards, you'd have to mimic the automatic
>routing.
>2> you have to mimic the analysis chain you've defined for
>each field in Solr.
>3> you have to copy the built Lucene indexes to the right shard
>(assuming you got <1> right).
>
>Depending on the docs in question, if they need Tika parsing
>you can do that simply in SolrJ too, see:
>https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
>(this is a bit outdated, a couple of class names have changed
>in particular).
>
>SolrJ uses an efficient binary format to move the docs. I regularly
>get 20K docs/second on my local setup, see:
>https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/
>I was indexing 11M Wiki articles n about 10 minutes on some tests
>recently. Solr can scale that close to linearly with more shards and
>enough indexing clients. Is it really worth the effort of using Lucene?
>
>FWIW,
>Erick
>
>
>
>On Wed, Jan 13, 2016 at 10:19 AM, Shivaji Dutta <sdu...@hortonworks.com>
>wrote:
>> Erik and Shawn
>>
>> Thanks for the input. In the process below we are posting the documents
>>to
>> Solr over HTTP Connection in batches.
>>
>> Trying to solve the same problem but in a different way :-
>>
>> I have used lucene back in the day, where I would index the documents
>> locally on the disk and run search queries on them. Big fan of lucene.
>>
>> I was wondering if there is any possibility like that.
>>
>> If I have a repository of millions of documents, would it not make sense
>> to just index them locally and then copy the index file over to Solr and
>> have it read from it?
>>
>> Any thoughts or blogs that could help me, or may be I am over thinking
>> this?
>>
>> Thanks
>> Shivaji
>>
>>
>> On 1/13/16, 9:12 AM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>>
>>>It's usually not all that difficult to write a multi-threaded
>>>client that uses CloudSolrClient, or even fire up multiple
>>>instances of the SolrJ client (assuming they can work
>>>on discreet sections of the documents you need to index).
>>>
>>>That avoids the problem Shawn alludes to. Plus other
>>>issues. If you do _not_ use CloudSolrClient, then all the
>>>docs go to some node in the system that then sub-divides
>>>the list (and you really should update in batches, see:
>>>https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/)
>>>then the node that receives the packet sub-divides it
>>>into groups based on what shard they should be part of
>>>and forwards them to the leaders for that shard, very
>>>significantly increasing the numbers of conversations
>>>being carried on between Solr nodes. Times the number
>>>of threads you're specifying with CUSC (I really regret
>>>the renaming from ConcurrentUpdateSolrServer, I liked
>>>writing CUSS).
>>>
>>>With CloudSolrClient, you can scale nearly linearly with
>>>the number of shards. Not so with CUSC.
>>>
>>>FWIW,
>>>Erick
>>>
>>>On Tue, Jan 12, 2016 at 8:06 PM, Shawn Heisey <apa...@elyograg.org>
>>>wrote:
>>>> On 1/12/2016 7:42 PM, Shivaji Dutta wrote:
>>>>> Now since with ConcurrentUdateSolrClient I am able to use a queue and
>>>>>a pool of threads, which makes it more attractive to use over
>>>>>CloudSolrClient which will use a HTTPSolrClient once it gets a set of
>>>>>nodes to do the updates.
>>>>>
>>>>> What is the recommended API for updating large amounts of documents
>>>>>with higher throughput rate.
>>>>
>>>> ConcurrentUpdateSolrClient has one flaw -- it swallows all exceptions
>>>> that happen during indexing.  Your application will never know about
>>>>any
>>>> problems that occur during indexing.  The entire cluster could be
>>>>down,
>>>> and your application would never know about it until you tried an
>>>> explicit commit operation.  Commit is an operation that is not handled
>>>> in the background by CUSC, so I would expect any exception to be
>>>> returned for that operation.
>>>>
>>>> This flaw is inherent to its design, the behavior would be very
>>>> difficult to change.
>>>>
>>>> If you don't care about your application getting error messages when
>>>> indexing requests fail, then CUSC is perfect.  This might be the case
>>>>if
>>>> you are doing initial bulk loading.  For normal index updates after
>>>> initial loading, you would not want to use CUSC.
>>>>
>>>> If you do care about getting error messages when bulk indexing
>>>>requests
>>>> fail, then you'll want to build a program with CloudSolrClient where
>>>>you
>>>> create multiple indexing threads that all use the same the client
>>>>object.
>>>>
>>>> Thanks,
>>>> Shawn
>>>>
>>>
>>
>

Reply via email to