I did exactly this in my custom lucene, since the array of a byte per
document is extremely wasteful in a lot of applications. I just changed the
code to return null from getNorms() and modified the callers to treat a null
array as always 1 for any document.

-----Original Message-----
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Friday, October 07, 2005 4:18 PM
To: java-dev@lucene.apache.org
Subject: Eliminating norms ... completley



Yonik and I have been looking at the memory requirements of an application
we've got.  We use a lot of indexed fields, primarily so I can do a lot
of numeric tests (using RangeFilter).   When I say "a lot" I mean
arround 8,000 -- many of which are not used by all documents in the index.

Now there are some basic usage changes I can make to cut this number in
half, and some more complex biz rule changes I can make to get the number
down some more (at the expense of flexibility) but even then we'd have
arround 1,000 -- which is still a lot more then the recommended "handful"

After discussing some options, I asked the question "Remind me again why
having lots of indexed fields makes the memory requirements jump up --
even if only a few documents use some field?" and Yonik reminded me about
the norm[] -- an array of bytes representating the field boost + length
boost for each document.  One of these arrays exists for every indexed
field.

So then I asked the $50,000,000 question:  "Is there any way to get rid of
this array for certain fields? ... or any way to get rid of it completely
for every field in a specific index?"

This may sound like a silly question for most IR applications where you
want length normalization to contribute to your scores, but in this
particular case most of these fields are only used to store single numeric
value, to be certain, there are some fields we have (or may add in the
future) that could benefit from having a narms[] ... but if it had to be
an all or nothing thing we could certainly live without them.

It seems to me, that in an ideal world, deciding wether or not you wanted
to store norms for a field would be like deciding wether you wanted to
store TermVectors for a field.  I can imagine a Field.isNormStored()
method ... but that seems like a pretty significant change to the existing
code base.


Alternately, I started wondering if if would be possible to write our own
IndexReader/IndexWriter subclasses that would ignore the norm info
completely (with maybe an optional list of field names the logic should be
lmited to), and return nothing but fixed values for any parts of the code
base that wanted them.  Looking at SegmentReader and MultiReader this
looked very promising (especailly considering the way SegmentReader uses a
system property to decide which acctaul class ot use).  But I was less
enthusiastic when i started looking at IndexWriter and the DocumentWriter
classes .... there doesn't seem to be any clean way to subclass the
existing code base to eliminate the writing of the norms to the Directory
(curses those final classes, and private final methods).


So I'm curious what you guys think...

  1) Regarding the root problem: is there any other things you can think
     of besides norms[] that would contribute to the memory foot print
     needed by a large number of indexed fields?
  2) Can you think of a clean way for individual applications to eliminate
     norms (via subclassing the lucene code base - ie: no patching)
  3) Yonik is currently looking into what kind of patch it would take to
     optionally turn off norms (I'm not sure if he's looking at doing it
     "per field" or "per index").  Is that the kind of thing that would
     even be considered for getting commited?

--

-------------------------------------------------------------------
"Oh, you're a tricky one."                        Chris M Hostetter
     -- Trisha Weir                    [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to