I did exactly this in my custom lucene, since the array of a byte per document is extremely wasteful in a lot of applications. I just changed the code to return null from getNorms() and modified the callers to treat a null array as always 1 for any document.
-----Original Message----- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Friday, October 07, 2005 4:18 PM To: java-dev@lucene.apache.org Subject: Eliminating norms ... completley Yonik and I have been looking at the memory requirements of an application we've got. We use a lot of indexed fields, primarily so I can do a lot of numeric tests (using RangeFilter). When I say "a lot" I mean arround 8,000 -- many of which are not used by all documents in the index. Now there are some basic usage changes I can make to cut this number in half, and some more complex biz rule changes I can make to get the number down some more (at the expense of flexibility) but even then we'd have arround 1,000 -- which is still a lot more then the recommended "handful" After discussing some options, I asked the question "Remind me again why having lots of indexed fields makes the memory requirements jump up -- even if only a few documents use some field?" and Yonik reminded me about the norm[] -- an array of bytes representating the field boost + length boost for each document. One of these arrays exists for every indexed field. So then I asked the $50,000,000 question: "Is there any way to get rid of this array for certain fields? ... or any way to get rid of it completely for every field in a specific index?" This may sound like a silly question for most IR applications where you want length normalization to contribute to your scores, but in this particular case most of these fields are only used to store single numeric value, to be certain, there are some fields we have (or may add in the future) that could benefit from having a narms[] ... but if it had to be an all or nothing thing we could certainly live without them. It seems to me, that in an ideal world, deciding wether or not you wanted to store norms for a field would be like deciding wether you wanted to store TermVectors for a field. I can imagine a Field.isNormStored() method ... but that seems like a pretty significant change to the existing code base. Alternately, I started wondering if if would be possible to write our own IndexReader/IndexWriter subclasses that would ignore the norm info completely (with maybe an optional list of field names the logic should be lmited to), and return nothing but fixed values for any parts of the code base that wanted them. Looking at SegmentReader and MultiReader this looked very promising (especailly considering the way SegmentReader uses a system property to decide which acctaul class ot use). But I was less enthusiastic when i started looking at IndexWriter and the DocumentWriter classes .... there doesn't seem to be any clean way to subclass the existing code base to eliminate the writing of the norms to the Directory (curses those final classes, and private final methods). So I'm curious what you guys think... 1) Regarding the root problem: is there any other things you can think of besides norms[] that would contribute to the memory foot print needed by a large number of indexed fields? 2) Can you think of a clean way for individual applications to eliminate norms (via subclassing the lucene code base - ie: no patching) 3) Yonik is currently looking into what kind of patch it would take to optionally turn off norms (I'm not sure if he's looking at doing it "per field" or "per index"). Is that the kind of thing that would even be considered for getting commited? -- ------------------------------------------------------------------- "Oh, you're a tricky one." Chris M Hostetter -- Trisha Weir [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]