I've run into something a little odd that's been happening for a while.

The apparent symptoms: Two index segments are created every time an autoCommit (hard, not soft) happens during a DIH full-import.

Here's the directory listing from the first few minutes of importing, and a related INFOSTREAM:

http://apaste.info/22ue
https://dl.dropboxusercontent.com/u/97770508/INFOSTREAM-s1build.txt

The INFOSTREAM file has cruft from before, so if you search for "3g8" in the file, you'll be at the beginning of the relevant section.

I brought this up without resolution on the dev list last December. After some discussion in #solr-dev yesterday and some poking around with branch_4x, I think I might have figured out (at a high level) what's going on.

My 'ramBufferSizeMB' value is 48, and my autoCommit maxDocs is 25000. My documents probably tend to be 1-2kb, with some increasing a little beyond that.

Looking at the numDocs for each segment, here's what I think is happening:

The autoCommit kicks in after the first 25000 docs (25002 to be precise), but the ram buffer isn't emptied. The next 3339 documents get indexed, at which point the ram buffer fills up, so it flushes another segment. Then it does another 21674 docs to approximately reach 25000 for autoCommit, which forces another segment flush, but without emptying the buffer. lather, rinse, repeat.

Each pair of numDocs values after the initial 25002 does add up to approximately 25000.

If I'm right about what's happening here, then here's the big question: Should the ram buffer be emptied when autoCommit triggers? I think that it should, but can it be done without drastically affecting performance? I haven't looked at the code ... I expect that it'll take me forever to understand it well enough to figure out if I'm right or wrong.

Reply via email to