Thanks Erick. On 1/13/16, 10:55 AM, "Erick Erickson" <erickerick...@gmail.com> wrote:
>My first thought is "yes, you're overthinking it" ;).... > >Here's something to get you started for indexing >through a Java program: >https://cwiki.apache.org/confluence/display/solr/Using+SolrJ > >Of course you _could_ use Lucene to build your indexes >and just copy them "to the right place", but there are >a number of ways that can go wrong, here are a couple: >1> if you have shards, you'd have to mimic the automatic >routing. >2> you have to mimic the analysis chain you've defined for >each field in Solr. >3> you have to copy the built Lucene indexes to the right shard >(assuming you got <1> right). > >Depending on the docs in question, if they need Tika parsing >you can do that simply in SolrJ too, see: >https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ >(this is a bit outdated, a couple of class names have changed >in particular). > >SolrJ uses an efficient binary format to move the docs. I regularly >get 20K docs/second on my local setup, see: >https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/ >I was indexing 11M Wiki articles n about 10 minutes on some tests >recently. Solr can scale that close to linearly with more shards and >enough indexing clients. Is it really worth the effort of using Lucene? > >FWIW, >Erick > > > >On Wed, Jan 13, 2016 at 10:19 AM, Shivaji Dutta <sdu...@hortonworks.com> >wrote: >> Erik and Shawn >> >> Thanks for the input. In the process below we are posting the documents >>to >> Solr over HTTP Connection in batches. >> >> Trying to solve the same problem but in a different way :- >> >> I have used lucene back in the day, where I would index the documents >> locally on the disk and run search queries on them. Big fan of lucene. >> >> I was wondering if there is any possibility like that. >> >> If I have a repository of millions of documents, would it not make sense >> to just index them locally and then copy the index file over to Solr and >> have it read from it? >> >> Any thoughts or blogs that could help me, or may be I am over thinking >> this? >> >> Thanks >> Shivaji >> >> >> On 1/13/16, 9:12 AM, "Erick Erickson" <erickerick...@gmail.com> wrote: >> >>>It's usually not all that difficult to write a multi-threaded >>>client that uses CloudSolrClient, or even fire up multiple >>>instances of the SolrJ client (assuming they can work >>>on discreet sections of the documents you need to index). >>> >>>That avoids the problem Shawn alludes to. Plus other >>>issues. If you do _not_ use CloudSolrClient, then all the >>>docs go to some node in the system that then sub-divides >>>the list (and you really should update in batches, see: >>>https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/) >>>then the node that receives the packet sub-divides it >>>into groups based on what shard they should be part of >>>and forwards them to the leaders for that shard, very >>>significantly increasing the numbers of conversations >>>being carried on between Solr nodes. Times the number >>>of threads you're specifying with CUSC (I really regret >>>the renaming from ConcurrentUpdateSolrServer, I liked >>>writing CUSS). >>> >>>With CloudSolrClient, you can scale nearly linearly with >>>the number of shards. Not so with CUSC. >>> >>>FWIW, >>>Erick >>> >>>On Tue, Jan 12, 2016 at 8:06 PM, Shawn Heisey <apa...@elyograg.org> >>>wrote: >>>> On 1/12/2016 7:42 PM, Shivaji Dutta wrote: >>>>> Now since with ConcurrentUdateSolrClient I am able to use a queue and >>>>>a pool of threads, which makes it more attractive to use over >>>>>CloudSolrClient which will use a HTTPSolrClient once it gets a set of >>>>>nodes to do the updates. >>>>> >>>>> What is the recommended API for updating large amounts of documents >>>>>with higher throughput rate. >>>> >>>> ConcurrentUpdateSolrClient has one flaw -- it swallows all exceptions >>>> that happen during indexing. Your application will never know about >>>>any >>>> problems that occur during indexing. The entire cluster could be >>>>down, >>>> and your application would never know about it until you tried an >>>> explicit commit operation. Commit is an operation that is not handled >>>> in the background by CUSC, so I would expect any exception to be >>>> returned for that operation. >>>> >>>> This flaw is inherent to its design, the behavior would be very >>>> difficult to change. >>>> >>>> If you don't care about your application getting error messages when >>>> indexing requests fail, then CUSC is perfect. This might be the case >>>>if >>>> you are doing initial bulk loading. For normal index updates after >>>> initial loading, you would not want to use CUSC. >>>> >>>> If you do care about getting error messages when bulk indexing >>>>requests >>>> fail, then you'll want to build a program with CloudSolrClient where >>>>you >>>> create multiple indexing threads that all use the same the client >>>>object. >>>> >>>> Thanks, >>>> Shawn >>>> >>> >> >