Re: Seeing what's occupying all the space in the index

Grant Ingersoll Fri, 26 May 2006 10:37:41 -0700

It seems odd to me that if you are using the CFS format, why you wouldhave the .fdt, .frq and .prx files in addition to the .cfs files. Myunderstanding is all files (except deletable and segment) get put insideof the CFS file. Looking at my indices, I only have the CFS file. Areyou optimizing your indices after you are done indexing? Are youturning off compound file format?

Can you try a smaller sample in a clean directory and see what size itis (so that it doesn't take as long to index)?


Rob Staveley (Tom) wrote:

Is there anything I can learn from the index directory's file listing?


Running this nasty little BASH one-liner...

$ for i in `ls * | perl -nle 'if (/^.+(\..+)/) {print $1;}' | sort |
uniq`;do ls -l *$i | awk '{SUM = SUM + $5} END {if (SUM > 1e10) {print

"'$i': ", SUM}}'; done

... I see....

        .cfs:  1.23155e+10
        .fdt:  5.06108e+10
        .frq:  1.27472e+10
        .prx:  1.3444e+10

That means I have 98 GB of files, with:51 GB devoted to field data (.fdt),13 BG devoted to term positions (.prx)

        13 BG devoted to term frequencies (.frq)
        12 BG devoted to compound files for the field index (.cfs)

Does that seem reasonable, bearing in mind I have only indexed 4.3 million
Lucene documents? That's 22.8 kB per Lucene document, and apart from a 300
character synopsis the fields are all much less than 100 characters long,
and yet this suggests that the index is providing 600 bytes per field.

--

Grant IngersollSr. Software EngineerCenter for Natural Language ProcessingSyracuse UniversitySchool of Information Studies335 Hinds HallSyracuse, NY 13244http://www.cnlp.orgVoice: 315-443-5484Fax: 315-443-6886


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Seeing what's occupying all the space in the index

Reply via email to