On Nov 12, 2007 1:15 PM, J.J. Larrea <[EMAIL PROTECTED]> wrote:
>
> 2. Since the full document and its longer bibliographic subfields are
> being indexed but not stored, my guess is that the large size of the index
> segments is due to the inverted index rather than the stored data fields.
> But you can roughly verify by checking the size of the files in the index,
> with Luke's Files tab or simply an ls -l. For example .fdt files are stored
> data while .tis are the inverted index; see
> http://lucene.apache.org/java/docs/fileformats.html And if you have .cfs
> files...
OK here are some of the details of one index directory. The number of
indexed documents is approximately 1.5million.
$ ls -1 |wc -l
29542
$ du -hs .
90G .
$ // Show space used (in GB) by file extension.
$ for filetype in `ls -1 | sed -r "s/.*\.(.*)/\1/" | sort -u` ;\
do echo -n "filetype=$filetype: " ; (echo -n "scale=2; \
(" ; (for size in `ls -l *.$filetype| cut -c 24-34`; do \
echo -n "$size+"; done) ; echo "0)/10^9") | bc; done
filetype=fdt: .44
filetype=fdx: .01
filetype=fnm: .01
filetype=frq: 15.95
filetype=lock: 0
filetype=nrm: .03
filetype=prx: 70.94
filetype=tii: .12
filetype=tis: 8.79
So most of the space is occupied with .prx files.
Thanks
Barry