On Nov 12, 2007 1:15 PM, J.J. Larrea <[EMAIL PROTECTED]> wrote: > > 2. Since the full document and its longer bibliographic subfields are > being indexed but not stored, my guess is that the large size of the index > segments is due to the inverted index rather than the stored data fields. > But you can roughly verify by checking the size of the files in the index, > with Luke's Files tab or simply an ls -l. For example .fdt files are stored > data while .tis are the inverted index; see > http://lucene.apache.org/java/docs/fileformats.html And if you have .cfs > files...
OK here are some of the details of one index directory. The number of indexed documents is approximately 1.5million. $ ls -1 |wc -l 29542 $ du -hs . 90G . $ // Show space used (in GB) by file extension. $ for filetype in `ls -1 | sed -r "s/.*\.(.*)/\1/" | sort -u` ;\ do echo -n "filetype=$filetype: " ; (echo -n "scale=2; \ (" ; (for size in `ls -l *.$filetype| cut -c 24-34`; do \ echo -n "$size+"; done) ; echo "0)/10^9") | bc; done filetype=fdt: .44 filetype=fdx: .01 filetype=fnm: .01 filetype=frq: 15.95 filetype=lock: 0 filetype=nrm: .03 filetype=prx: 70.94 filetype=tii: .12 filetype=tis: 8.79 So most of the space is occupied with .prx files. Thanks Barry