hi Shawn, Thanks a bunch for your feedback. I've played with the heap size, but I don't see any improvement. Even if i index, say , a million docs, and the throughput is about 300 docs per sec, and then I shut down solr completely - after I start indexing again, the throughput is dropping below 300.
I should probably experiment with sharding those documents to multiple SOLR cores - that should help, I guess. I am talking about something like this: https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud Thanks, Angel On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey <apa...@elyograg.org> wrote: > On 5/21/2015 2:07 AM, Angel Todorov wrote: > > I'm crawling a file system folder and indexing 10 million docs, and I am > > adding them in batches of 5000, committing every 50 000 docs. The > problem I > > am facing is that after each commit, the documents per sec that are > indexed > > gets less and less. > > > > If I do not commit at all, I can index those docs very quickly, and then > I > > commit once at the end, but once i start indexing docs _after_ that (for > > example new files get added to the folder), indexing is also slowing > down a > > lot. > > > > Is it normal that the SOLR indexing speed depends on the number of > > documents that are _already_ indexed? I think it shouldn't matter if i > > start from scratch or I index a document in a core that already has a > > couple of million docs. Looks like SOLR is either doing something in a > > linear fashion, or there is some magic config parameter that I am not > aware > > of. > > > > I've read all perf docs, and I've tried changing mergeFactor, > > autowarmCounts, and the buffer sizes - to no avail. > > > > I am using SOLR 5.1 > > Have you changed the heap size? If you use the bin/solr script to start > it and don't change the heap size with the -m option or another method, > Solr 5.1 runs with a default size of 512MB, which is *very* small. > > I bet you are running into problems with frequent and then ultimately > constant garbage collection, as Java attempts to free up enough memory > to allow the program to continue running. If that is what is happening, > then eventually you will see an OutOfMemoryError exception. The > solution is to increase the heap size. I would probably start with at > least 4G for 10 million docs. > > Thanks, > Shawn > >