Thank you for your response. I did not make the StreamingUpdate application yet, but I did change the other settings that you mentioned. It gave me a huge boost in indexing speed. (I am still using post.sh but hope to change that soon).
One thing I noticed is the indexing speed was incredibly fast last night, but today the commits are taking so long. Is this to be expected? -- Best Regards, Charles Wardell Blue Chips Technology, Inc. www.bcsolution.com On Wednesday, April 27, 2011 at 6:15 PM, Otis Gospodnetic wrote: > Hi Charles, > > Yes, the threads I was referring to are in the context of the client/indexer, > so > one of the params for StreamingUpdateSolrServer. > post.sh/jar are just there because they are handy. Don't use them for > production. > > It's impossible to tell how long indexing of 100M documents may take. They > could be very big or very small. You could perform very light or no analysis > or > heavy analysis. They could contain 1 or 100 fields. :) > > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message ---- > > From: Charles Wardell <charles.ward...@bcsolution.com> > > To: solr-user@lucene.apache.org > > Sent: Tue, April 26, 2011 8:01:28 PM > > Subject: Re: Question on Batch process > > > > Thank you Otis. > > Without trying to appear to stupid, when you refer to having the params > > matching your # of CPU cores, you are talking about the # of threads I can > > spawn with the StreamingUpdateSolrServer object? > > Up until now, I have been just utilizing post.sh or post.jar. Are these > > capable of that or do I need to write some code to collect a bunch of files > > into the buffer and send it off? > > > > Also, Do you have a sense for how long it should take to index 100,000 > > files > > or in my case 100,000,000 documents? > > StreamingUpdateSolrServer > > public StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int > > threadCount) throws MalformedURLException > > > > Thanks again, > > Charlie > > > > -- > > Best Regards, > > > > Charles Wardell > > Blue Chips Technology, Inc. > > www.bcsolution.com > > > > On Tuesday, April 26, 2011 at 5:12 PM, Otis Gospodnetic wrote: > > > Charlie, > > > > > > How's this: > > > * -Xmx2g > > > * ramBufferSizeMB 512 > > > * mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n > > allows) > > > * ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB > > > * use SolrStreamingUpdateServer (with params matching your number of CPU > > cores) > > > > > or send batches of say 1000 docs with the other SolrServer impl using N > > threads > > > > > (N=# of your CPU cores) > > > > > > Otis > > > ---- > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > > > > > ----- Original Message ---- > > > > From: Charles Wardell <charles.ward...@bcsolution.com> > > > > To: solr-user@lucene.apache.org > > > > Sent: Tue, April 26, 2011 2:32:29 PM > > > > Subject: Question on Batch process > > > > > > > > I am sure that this question has been asked a few times, but I can't > > > > seem > > to > > > > > > find the sweetspot for indexing. > > > > > > > > I have about 100,000 files each containing 1,000 xml documents ready to > > > > be > > > > > > posted to Solr. My desire is to have it index as quickly as possible > > > > and > > then > > > > > > once completed the daily stream of ADDs will be small in comparison. > > > > > > > > The individual documents are small. Essentially web postings from the > > > > net. > > > > > > Title, postPostContent, date. > > > > > > > > > > > > What would be the ideal configuration? For RamBufferSize, mergeFactor, > > > > MaxbufferedDocs, etc.. > > > > > > > > My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in > TOP > > > > I have 16GB of available ram. > > > > > > > > > > > > Thanks in advance. > > > > Charlie >