I have similar issues by using DIH, and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand) consumes most of the time when indexing 10K rows (each row is about 70K) - DIH nextRow takes about 10 seconds totally - If index uses whitespace tokenizer and lower case filter, then addDoc() methods takes about 80 seconds - If index uses whitespace tokenizer, lower case filer, WDF, then addDoc uses about 112 seconds - If index uses whitespace tokenizer, lower case filer, WDF and porter stemmer, then addDoc uses about 145 seconds
We have more than million rows totally, and am wondering whether i am using sth. wrong or is there any way to improve the performance of addDoc()? Thanks very much in advance! Following is the configure: 1) JVM: -Xms256M -Xmx1048M -XX:MaxPermSize=512m 2) Solr version 3.5 3) solrconfig.xml (almost copied from solr's example/solr directory.) <indexDefaults> <useCompoundFile>false</useCompoundFile> <mergeFactor>10</mergeFactor> <!-- Sets the amount of RAM that may be used by Lucene indexing for buffering added documents and deletions before they are flushed to the Directory. --> <ramBufferSizeMB>64</ramBufferSizeMB> <!-- If both ramBufferSizeMB and maxBufferedDocs is set, then Lucene will flush based on whichever limit is hit first. --> <!-- <maxBufferedDocs>1000</maxBufferedDocs> --> <maxFieldLength>2147483647</maxFieldLength> <writeLockTimeout>1000</writeLockTimeout> <commitLockTimeout>10000</commitLockTimeout> <lockType>native</lockType> </indexDefaults> 2012/3/11 Peyman Faratin <pey...@robustlinks.com> > Hi > > I am trying to index 12MM docs faster than is currently happening in Solr > (using solrj). We have identified solr's add method as the bottleneck (and > not commit - which is tuned ok through mergeFactor and maxRamBufferSize and > jvm ram). > > Adding 1000 docs is taking approximately 25 seconds. We are making sure we > add and commit in batches. And we've tried both CommonsHttpSolrServer and > EmbeddedSolrServer (assuming removing http overhead would speed things up > with embedding) but the differences is marginal. > > The docs being indexed are on average 20 fields long, mostly indexed but > none stored. The major size contributors are two fields: > > - content, and > - shingledContent (populated using copyField of content). > > The length of the content field is (likely) gaussian distributed (few > large docs 50-80K tokens, but majority around 2k tokens). We use > shingledContent to support phrase queries and content for unigram queries > (following the advice of Solr Enterprise search server advice - p. 305, > section "The Solution: Shingling"). > > Clearly the size of the docs is a contributor to the slow adds (confirmed > by removing these 2 fields resulting in halving the indexing time). We've > tried compressed=true also but that is not working. > > Any guidance on how to support our application logic (without having to > change the schema too much) and speed the indexing speed (from current 212 > days for 12MM docs) would be much appreciated. > > thank you > > Peyman > >