I'm using Solr compiled from a branch_4x checkout.
solr-impl 4.1-SNAPSHOT 1416639M - ncindex - 2012-12-03 12:54:38
I've noticed something really odd happening during DIH full-import of
millions of documents, and I'm wondering if it's a bug. Configbits that
I think may be relevant are below. If you'd like more information,
please let me know what you'd like and whether I need to turn on
settings like infostream and do another import:
Autocommit is set to maxDocs 65536 docs and maxTime 300000.
ramBufferSizeMB is 100.
updateLog is enabled, no options.
What's happening is that whenever it hits maxDocs, I get 2 segment
files, one of them significantly smaller than the other. Rarely, it
creates 3 segments! I know it's not a ramBuffer problem, because
initially the exact same thing was happening with maxDocs at 100000 and
a 32MB ramBuffer. I raised the ramBuffer and lowered the maxDocs. It
takes significantly less than 5 minutes maxDocs to get indexed, so the
maxTime value should not be a factor.
Sometimes the last segment is incomplete until the next autocommit,
consisting only of files like the following. On the next autocommit,
the incomplete segment is completed.
-rw-r--r-- 1 ncindex ncindex 411 Dec 3 14:22 _fu.si
-rw-r--r-- 1 ncindex ncindex 55966 Dec 3 14:22 _fu_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex 1983125 Dec 3 14:22 _fu_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex 1720492 Dec 3 14:22 _fu_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex 1384931 Dec 3 14:22 _fu_Lucene41_0.doc
Sometimes the last segment does get written completely before the next
autocommit. I have no idea what makes things happen differently sometimes:
-rw-r--r-- 1 ncindex ncindex 144497 Dec 3 14:16 _fq.tvx
-rw-r--r-- 1 ncindex ncindex 6106209 Dec 3 14:16 _fq.tvf
-rw-r--r-- 1 ncindex ncindex 18090 Dec 3 14:16 _fq.tvd
-rw-r--r-- 1 ncindex ncindex 411 Dec 3 14:16 _fq.si
-rw-r--r-- 1 ncindex ncindex 67683 Dec 3 14:16 _fq_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex 2431846 Dec 3 14:16 _fq_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex 2412246 Dec 3 14:16 _fq_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex 1834286 Dec 3 14:16 _fq_Lucene41_0.doc
-rw-r--r-- 1 ncindex ncindex 1152 Dec 3 14:16 _fq.fdx
-rw-r--r-- 1 ncindex ncindex 2518453 Dec 3 14:16 _fq.fdt
-rw-r--r-- 1 ncindex ncindex 2518453 Dec 3 14:16 _fq.fdt
Every other segment is at least ten times as large as the others. It
writes the large segment first. Here's an example of a large segment.
Both of the segment listings above are from small segments:
-rw-r--r-- 1 ncindex ncindex 11289877 Dec 3 14:21 _ft.fdt
-rw-r--r-- 1 ncindex ncindex 7757 Dec 3 14:21 _ft.fdx
-rw-r--r-- 1 ncindex ncindex 3114 Dec 3 14:21 _ft.fnm
-rw-r--r-- 1 ncindex ncindex 8304619 Dec 3 14:21 _ft_Lucene41_0.doc
-rw-r--r-- 1 ncindex ncindex 9054058 Dec 3 14:21 _ft_Lucene41_0.pos
-rw-r--r-- 1 ncindex ncindex 9666900 Dec 3 14:21 _ft_Lucene41_0.tim
-rw-r--r-- 1 ncindex ncindex 244322 Dec 3 14:21 _ft_Lucene41_0.tip
-rw-r--r-- 1 ncindex ncindex 115 Dec 3 14:21 _ft_nrm.cfe
-rw-r--r-- 1 ncindex ncindex 170365 Dec 3 14:21 _ft_nrm.cfs
-rw-r--r-- 1 ncindex ncindex 411 Dec 3 14:21 _ft.si
-rw-r--r-- 1 ncindex ncindex 113554 Dec 3 14:21 _ft.tvd
-rw-r--r-- 1 ncindex ncindex 23374630 Dec 3 14:21 _ft.tvf
-rw-r--r-- 1 ncindex ncindex 908209 Dec 3 14:21 _ft.tvx
Thanks,
Shawn
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org