are you by any chance using different field names for each document -- or do you have a wide range of field names that aren't the same for each document? ... you mentioned indexing emails, email has a very loose header structure that allows MTAs to add arbitrary "X" headers, are you converting every header to an indexed field?
the reason i ask is that anytime you have an indexed field with and you don't OMIT_NORMS when you add the field, lucene allocates one byte per document for that field to store the norm value -- even most of those documents don't have that field. if you've got ~25,000 documents in your index, and you add a new document with an indexed field no other document has, then you'll see at least a 25K increase in index size because of those norms. (OMIT_NORMS is your friend). : Date: Fri, 26 May 2006 15:50:43 +0100 : From: "Rob Staveley (Tom)" <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: RE: Seeing what's occupying all the space in the index : : > I can't see how Luke is going to show me what's occupying most of my : index. : : I do however notice that none of my stored fields are stored compressed. : Presumably Field.Store COMPRESS is something that is new in Lucene 1.9 and : wasn't available in 1.4.3?? However, it is still hard to see what's causing : the index to grow by 25kB with each document I index. : : Is there anything I can learn from the index directory's file listing? : Here's the top of file listing in descending order of size from the index : directory: : : --------8<-------- : [EMAIL PROTECTED]:~/dat/indexd/index-1$ ls -lS | more : total 95374800 : -rw-r--r-- 1 rob dev 4449248724 May 26 00:32 _2fyfe.cfs : -rw-r--r-- 1 rob dev 2522952122 May 26 14:17 _2k5vi.fdt : -rw-r--r-- 1 rob dev 2413775516 May 26 01:16 _2g6l3.fdt : -rw-r--r-- 1 rob dev 2368881846 May 25 18:14 _2ehe7.fdt : -rw-r--r-- 1 rob dev 2344670598 May 25 16:31 _2dn69.fdt : -rw-r--r-- 1 rob dev 2324315860 May 25 14:10 _2cxea.fdt : -rw-r--r-- 1 rob dev 2259070168 May 25 10:28 _2aeb0.fdt : -rw-r--r-- 1 rob dev 2113143078 May 24 13:39 _24xxv.fdt : -rw-r--r-- 1 rob dev 2005876265 May 23 16:47 _21143.fdt : -rw-r--r-- 1 rob dev 1994658169 May 23 16:09 _20mgq.fdt : -rw-r--r-- 1 rob dev 1991402285 May 23 14:21 _20id9.fdt : -rw-r--r-- 1 rob dev 1973739578 May 23 12:17 _1zvpr.fdt : -rw-r--r-- 1 rob dev 1964392156 May 23 11:14 _1zjra.fdt : -rw-r--r-- 1 rob dev 1957195484 May 23 10:27 _1zanx.fdt : -rw-r--r-- 1 rob dev 1940435968 May 12 13:58 _1374v.cfs : -rw-r--r-- 1 rob dev 1932876050 May 23 08:34 _1yep1.fdt : -rw-r--r-- 1 rob dev 1908759860 May 22 22:26 _1xhs9.fdt : -rw-r--r-- 1 rob dev 1862224271 May 22 14:24 _1vxc8.fdt : -rw-r--r-- 1 rob dev 1775652952 May 21 19:15 _1slqr.fdt : -rw-r--r-- 1 rob dev 1674336305 May 20 10:10 _1oaje.fdt : -rw-r--r-- 1 rob dev 1641906176 May 17 12:24 _1f3lp.cfs : -rw-r--r-- 1 rob dev 1626645390 May 19 16:39 _1mjaa.fdt : -rw-r--r-- 1 rob dev 1412089155 May 17 12:21 _1f3lp.fdt : -rw-r--r-- 1 rob dev 1400399872 May 19 16:51 _1mjaa.cfs : -rw-r--r-- 1 rob dev 1090611130 May 12 13:51 _1374v.fdt : -rw-r--r-- 1 rob dev 1059463168 May 16 09:24 _1azb3.fdt : -rw-r--r-- 1 rob dev 1052419072 May 11 16:36 _1168u.cfs : -rw-r--r-- 1 rob dev 1034522679 May 11 16:34 _1168u.fdt : -rw-r--r-- 1 rob dev 1023033344 May 23 14:51 _20m4q.fdt : -rw-r--r-- 1 rob dev 857224192 May 4 18:16 _lluq.cfs : -rw-r--r-- 1 rob dev 821198047 May 26 14:21 _2k5vi.prx : -rw-r--r-- 1 rob dev 808694510 May 8 14:09 _sunn.fdt : -rw-r--r-- 1 rob dev 786164503 May 26 01:26 _2g6l3.prx : -rw-r--r-- 1 rob dev 780030917 May 26 14:21 _2k5vi.frq : -rw-r--r-- 1 rob dev 772484883 May 25 18:19 _2ehe7.prx : -rw-r--r-- 1 rob dev 763351845 May 25 16:41 _2dn69.prx : -rw-r--r-- 1 rob dev 755794097 May 25 14:14 _2cxea.prx : -rw-r--r-- 1 rob dev 745972375 May 26 01:26 _2g6l3.frq : -rw-r--r-- 1 rob dev 732753582 May 25 10:45 _2aeb0.prx : -rw-r--r-- 1 rob dev 732496935 May 25 18:19 _2ehe7.frq : -rw-r--r-- 1 rob dev 724428884 May 25 16:41 _2dn69.frq : -rw-r--r-- 1 rob dev 719733760 May 25 09:49 _29vyk.fdt : -rw-r--r-- 1 rob dev 717613127 May 25 14:14 _2cxea.frq : -rw-r--r-- 1 rob dev 696849854 May 25 10:45 _2aeb0.frq : -rw-r--r-- 1 rob dev 686227498 May 24 13:59 _24xxv.prx : --------8<-------- : -Hoss --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]