Re: Indexing gets significantly slower after every batch commit

Angel Todorov Thu, 21 May 2015 10:00:18 -0700

hi Shawn,

Thanks a bunch for your feedback. I've played with the heap size, but I
don't see any improvement. Even if i index, say , a million docs, and the
throughput is about 300 docs per sec, and then I shut down solr completely
- after I start indexing again, the throughput is dropping below 300.


I should probably experiment with sharding those documents to multiple SOLR
cores - that should help, I guess. I am talking about something like this:

https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud

Thanks,
Angel


On Thu, May 21, 2015 at 11:36 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 5/21/2015 2:07 AM, Angel Todorov wrote:
> > I'm crawling a file system folder and indexing 10 million docs, and I am
> > adding them in batches of 5000, committing every 50 000 docs. The
> problem I
> > am facing is that after each commit, the documents per sec that are
> indexed
> > gets less and less.
> >
> > If I do not commit at all, I can index those docs very quickly, and then
> I
> > commit once at the end, but once i start indexing docs _after_ that (for
> > example new files get added to the folder), indexing is also slowing
> down a
> > lot.
> >
> > Is it normal that the SOLR indexing speed depends on the number of
> > documents that are _already_ indexed? I think it shouldn't matter if i
> > start from scratch or I index a document in a core that already has a
> > couple of million docs. Looks like SOLR is either doing something in a
> > linear fashion, or there is some magic config parameter that I am not
> aware
> > of.
> >
> > I've read all perf docs, and I've tried changing mergeFactor,
> > autowarmCounts, and the buffer sizes - to no avail.
> >
> > I am using SOLR 5.1
>
> Have you changed the heap size?  If you use the bin/solr script to start
> it and don't change the heap size with the -m option or another method,
> Solr 5.1 runs with a default size of 512MB, which is *very* small.
>
> I bet you are running into problems with frequent and then ultimately
> constant garbage collection, as Java attempts to free up enough memory
> to allow the program to continue running.  If that is what is happening,
> then eventually you will see an OutOfMemoryError exception.  The
> solution is to increase the heap size.  I would probably start with at
> least 4G for 10 million docs.
>
> Thanks,
> Shawn
>
>

Re: Indexing gets significantly slower after every batch commit

Reply via email to