Hi Charles, Yes, the threads I was referring to are in the context of the client/indexer, so one of the params for StreamingUpdateSolrServer. post.sh/jar are just there because they are handy. Don't use them for production.
It's impossible to tell how long indexing of 100M documents may take. They could be very big or very small. You could perform very light or no analysis or heavy analysis. They could contain 1 or 100 fields. :) Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: Charles Wardell <charles.ward...@bcsolution.com> > To: solr-user@lucene.apache.org > Sent: Tue, April 26, 2011 8:01:28 PM > Subject: Re: Question on Batch process > > Thank you Otis. > Without trying to appear to stupid, when you refer to having the params >matching your # of CPU cores, you are talking about the # of threads I can >spawn with the StreamingUpdateSolrServer object? > Up until now, I have been just utilizing post.sh or post.jar. Are these >capable of that or do I need to write some code to collect a bunch of files >into the buffer and send it off? > > Also, Do you have a sense for how long it should take to index 100,000 files >or in my case 100,000,000 documents? > StreamingUpdateSolrServer > public StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int >threadCount) throws MalformedURLException > > Thanks again, > Charlie > > -- > Best Regards, > > Charles Wardell > Blue Chips Technology, Inc. > www.bcsolution.com > > On Tuesday, April 26, 2011 at 5:12 PM, Otis Gospodnetic wrote: > > Charlie, > > > > How's this: > > * -Xmx2g > > * ramBufferSizeMB 512 > > * mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n >allows) > > * ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB > > * use SolrStreamingUpdateServer (with params matching your number of CPU >cores) > > > or send batches of say 1000 docs with the other SolrServer impl using N >threads > > > (N=# of your CPU cores) > > > > Otis > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > ----- Original Message ---- > > > From: Charles Wardell <charles.ward...@bcsolution.com> > > > To: solr-user@lucene.apache.org > > > Sent: Tue, April 26, 2011 2:32:29 PM > > > Subject: Question on Batch process > > > > > > I am sure that this question has been asked a few times, but I can't > > > seem >to > > > > find the sweetspot for indexing. > > > > > > I have about 100,000 files each containing 1,000 xml documents ready to > > > be > > > > posted to Solr. My desire is to have it index as quickly as possible and >then > > > > once completed the daily stream of ADDs will be small in comparison. > > > > > > The individual documents are small. Essentially web postings from the > > > net. > > > > Title, postPostContent, date. > > > > > > > > > What would be the ideal configuration? For RamBufferSize, mergeFactor, > > > MaxbufferedDocs, etc.. > > > > > > My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in TOP > > > I have 16GB of available ram. > > > > > > > > > Thanks in advance. > > > Charlie > > >