Marvin Humphrey wrote on 1/27/10 6:41 PM:
On Tue, Jan 26, 2010 at 07:15:16PM -0800, Marvin Humphrey wrote:
Yup, I've now duplicated the problem on my system using 60,000 docs.
Fixed by r5764.
cool. thanks for digging in.
I have tested it under RHEL (works great with ~90k docs, 2g of data) and OSX
10.6 (where it fails, see below), both 64-bit arch.
The OSX behaviour was weird. First time it segfaulted. Ran it again under gdb
and it completed ok. Ran it again without gdb and I got this:
[kar...@pekmac:~/tmp]$ perl ks-test.pl swishdocs2/
Crawled 1000000 documents
Read past EOF of
'/Volumes/users/karpet/tmp/test-ks-utf8/seg_2/ptemp-4284913-to-4383411' (offset:
4284913 len: 98498), S_refill at ../core/KinoSearch/Store/InStream.c line 145
at ks-test.pl line 65
Using same test script as I posted before, with 1m docs instead of 33k.
I bet I can get that way down by fiddling with the flush threshold.
Ultimately, I was isolate the trigger to a single document with two fields, by
bringing the threshold at which PostingListWriter flushes all of its
PostingPools way, way down:
-#define DEFAULT_MEM_THRESH 0x1000000
+/* #define DEFAULT_MEM_THRESH 0x1000000 */
+#define DEFAULT_MEM_THRESH 0x10
When that variable lived in Perl, the KinoSearch::Test module used to set it
to a much smaller number at load time. This had the effect of simulating
large indexes as far as PostingListWriter was concerned, by forcing runs to be
flushed many many times. However, it turns out that we have been doing
without that important simulation for a long time -- the entire KS test suite
was not triggering a PostingPool flush even once. I'm a little surprised that
after all the refactoring I did on this code recently, there was only a single
glitch that needed to be fixed.
Now even if I set the threshold to 0x100, the whole test suite passes.
this is good and interesting to know. Is there, or any plan to, make the
DEFAULT_MEM_THRESH alterable at runtime? I'm assuming that in situations where
available ram is low, it would be helpful to trade-off speed for memory by
setting the threshold lower and flushing to disk more often. Is that a realistic
assumption?
--
Peter Karman . http://peknet.com/ . [email protected]