Re: Question on Batch process

Otis Gospodnetic Wed, 27 Apr 2011 15:15:57 -0700

Hi Charles,

Yes, the threads I was referring to are in the context of the client/indexer, 
so 
one of the params for StreamingUpdateSolrServer.
post.sh/jar are just there because they are handy.  Don't use them for 
production.


It's impossible to tell how long indexing of 100M documents may take.  They 
could be very big or very small.  You could perform very light or no analysis 
or 
heavy analysis.  They could contain 1 or 100 fields. :)

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Charles Wardell <charles.ward...@bcsolution.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, April 26, 2011 8:01:28 PM
> Subject: Re: Question on Batch process
> 
> Thank you Otis.
> Without trying to appear to stupid, when you refer to having  the params 
>matching your # of CPU cores, you are talking about the # of threads  I can 
>spawn with the StreamingUpdateSolrServer object?
> Up until now, I have  been just utilizing post.sh or post.jar. Are these 
>capable of that or do I need  to write some code to collect a bunch of files 
>into the buffer and send it  off?
> 
> Also, Do you have a sense for how long it should take to index  100,000 files 
>or in my case 100,000,000  documents?
> StreamingUpdateSolrServer
> public  StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int 
>threadCount)  throws MalformedURLException
> 
> Thanks again,
> Charlie
> 
> -- 
> Best  Regards,
> 
> Charles Wardell
> Blue Chips Technology,  Inc.
> www.bcsolution.com
> 
> On Tuesday, April 26, 2011 at 5:12 PM, Otis  Gospodnetic wrote: 
> > Charlie,
> > 
> > How's this:
> > *  -Xmx2g
> > * ramBufferSizeMB 512
> > * mergeFactor 10 (default, but you  could up it to 20, 30, if ulimit -n 
>allows)
> > * ignore/delete  maxBufferedDocs - not used if you ran ramBufferSizeMB
> > * use  SolrStreamingUpdateServer (with params matching your number of CPU 
>cores) 
>
> > or send batches of say 1000 docs with the other SolrServer impl using N  
>threads 
>
> > (N=# of your CPU cores)
> > 
> > Otis
> >  ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Lucene ecosystem  search :: http://search-lucene.com/
> > 
> > 
> > 
> > ----- Original  Message ----
> > > From: Charles Wardell <charles.ward...@bcsolution.com>
> >  > To: solr-user@lucene.apache.org
> >  > Sent: Tue, April 26, 2011 2:32:29 PM
> > > Subject: Question on  Batch process
> > > 
> > > I am sure that this question has been  asked a few times, but I can't 
> > > seem 
>to 
>
> > > find the sweetspot for  indexing.
> > > 
> > > I have about 100,000 files each containing  1,000 xml documents ready to 
> > > be 
>
> > > posted to Solr. My desire is to  have it index as quickly as possible and 
>then 
>
> > > once completed the  daily stream of ADDs will be small in comparison.
> > > 
> > > The  individual documents are small. Essentially web postings from the 
> > > net. 
>
> >  > Title, postPostContent, date. 
> > > 
> > > 
> > >  What would be the ideal configuration? For RamBufferSize, mergeFactor, 
> >  > MaxbufferedDocs, etc..
> > > 
> > > My machine is a quad core  hyper-threaded. So it shows up as 8 cpu's in 
TOP
> > > I have 16GB of  available ram.
> > > 
> > > 
> > > Thanks in  advance.
> > > Charlie
> > 
>

Re: Question on Batch process

Reply via email to