On Tue, Jan 26, 2010 at 12:09:20AM -0600, Peter Karman wrote:
> Here's the test case.
Thanks for the hard work building this case.
> perl docmaker.pl \
> --utf_factor=0 \
> --write_files \
> --tmp_dir path/to/my/testdocs/ \
> --max_files 33000 \
> --max_words 3 \
> --tmp_dir_segments 2
I wonder whether this produces the same corpus on my OS X 10.5.8 MBPro as on
your system.
> there appears to be something magical in the *total number* of terms parsed.
Might have something to do with when runs are flushed.
> Here are some things I notice.
>
> 1) if I comment out the swishwordnum and swishdescription in parse_file()
> it works.
>
> 2) if I comment out the swishdescription alone, it fails.
>
> 3) if I comment out the swishwordnum alone, it fails.
I tried out all four possible permutations of swishwordnum and
swishdescription:
swishdescription => "", # yes, empty
swishwordnum => 0, # yes, zero
#swishdescription => "", # yes, empty
swishwordnum => 0, # yes, zero
swishdescription => "", # yes, empty
#swishwordnum => 0, # yes, zero
#swishdescription => "", # yes, empty
#swishwordnum => 0, # yes, zero
No matter what, I see the following output:
mar...@smokey:~/projects/ks/perl $ rm -rf test-ks-utf8/ ; perl -Mblib
karpet_utf8_test.pl testdocs/
Crawled 33000 documents
mar...@smokey:~/projects/ks/perl $
Before we go further, what kind of system are you having trouble on? Is it a
64-bit box?
Marvin Humphrey