On Monday 23 May 2005 02:04, Arvind Srinivasan wrote: > One Byte is Seven bits too many? - A Design suggestion > > Hi, > > The norm takes up 1 byte of storage per document per field. While this may seem > very small, a simple calculation shows that the IndexSearcher can consume lots of > memory when it caches the norms. Further, the current implementation loads up the > norms in memory as soon as the segments gets loaded. Here are the calculations: > > For Medium sized archives > docs=40Million, Fields=2 => 80MB memory > docs=40Million, Fields=10 => 400MB memory > docs=40Million, Fields=20 => 800MB memory > > For larger sized archives > > docs=400Million, Fields=2 => 800MB memory > docs=400Million, Fields=10 => ~4GB memory > docs=400Million, Fields=20 => ~8GB memory > > > To further compound the issues, we have found JVM performance drops when the memory > that it manages increases. > > While the storage itself may not be concern, the runtime memory requirement can use > some optimization, especially for large number of fields. > The fields itself may fall in one of 3 categories > > (a) Tokenized fields have huge variance in number of Tokens, > example - HTML page, Mail Body etc. > (b) Tokenized fields with very little variance in number of token, > example - HTML Page Title, Mail Subject etc. > (c) Fixed Tokenized Fields > example - Department, City, State etc. > > > The one byte usage is very applicable for (a) and not for (b) or (c). In typical > usage, field increases can be attributed to (b) and (c).
(c) would also be a nice fit for the recently discussed constant scoring queries. For (b) the relative variance and the influence and on the score is still high. Perhaps a mixed form with a minimum field length in a single bit could be considered there, but addressing that might be costly. Regards, Paul Elschot. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]