Re: Numeric Range Restrictions: Queries vs Filters

2004-11-22 Thread Hoss

Of course, not only did I manage to forget to include the attachment, but
when I sent a reply with the code, mail.apache.org rejected it because it
was a ZIP file.

So let's see how mail.apache.or feels about 6 seperate text files.


: Date: Mon, 22 Nov 2004 18:25:24 -0800 (PST)
: Subject: Numeric Range Restrictions: Queries vs Filters
:
: (NOTE: numbers in [] indicate Footnotes)
:
: I'm rather new to Lucene (and this list), so if I'm grossly
: misunderstanding things, forgive me.
:
: One of my main needs as I investigate Search technologies is to restrict
: results based on Ranges of numeric values.  Looking over the archives of
: this list, it seems that lots of people have run into problems dealing
: with this.  In particular, whenever someone asks a question about "Numeric
: Ranges" the question seem to always involve one (or more) of the
: following:
:
:(a) Lexical sorting puts 11 in the range "1 TO 5"
:(b) Dates (or Dates and Times)
:(c) BooleanQuery$TooManyClauses Exceptions
:(d) Should I use a filter?
:
: (a) is a solved problem as long as you use a formatter like
: LongField.java[1]
:
: (b) is really nothing more then a special case of dealing with generic
: numeric values.  While there are certainly special purposes solutions that
: sometimes apply to dealing with Date ranges, any good solution for dealing
: with raw numeric ranges can be applied to Dates (and Times)
:
: (c) is a situation that seems to come up a lot because of the way
: RangeQuery works.  The rewrite method walks all of the Terms in the index
: starting with "lowerTerm" and builds up BooleanQuery containing a separate
: TermQuery for every Term found, until it reaches the upperTerm.  This
: causes a range search of "0001 TO 1000" to generate a BooleanQuery with N
: clauses, where N is the quantity of unique values in the field which are
: lexically greater then 0001 and lexically less then 1000.  depending on
: the nature of your data, this might be 0 BooleanClauses, or it might be
: 1000 BooleanClauses; but the list is built before the search is ever even
: executed.
:
: At first, this may seem really strange -- I know I was certainly confused
: -- but there is a very good reason for it: Ultimately RangeQuery still
: provides you with a meaningful score for each document, based on the
: frequency (and quantity) of terms that document has in the range [2].  In
: order to do that, it has to expand itself, but what if you don't care if
: your Range restriction impacts the Score? [3]
:
: Which brings us to...
:
: (c) Filtering.  Filters in general make a lot of sense to me.  They are a
: way to specify (at query time) that only a certain subset of the index
: should be considered for results.  The Filter class has a very straight
: forward API that seems very easy to subclass to get the behavior I want.
: The Query API on the other hand ... I freely admit, that I can't make
: heads or tails out of it.  I don't even know where I would begin to try
: and write a new subclass of Query if I wanted to.
:
: I would think that most people who want to do a "numeric range
: restriction" on their data, probably don't care about the Scoring benefits
: of RangeQuery.  Looking at the code base, the way DateFilter works seems
: like it provides an ideal solution to any sort of Range restriction (not
: just Dates) that *should* be more efficient then using RangeQuery when
: dealing with an unbounded value set. (Both approaches need to iterate over
: all of the terms in the specified field using TermEnum, but RangeQuery has
: to build up an set of BooleanQuery objects for each matching term, and
: then each of those queries have to help score the documents -- DateFilter
: on the other hand only has to maintain a single BitSet of documents that
: it finds as it iterates)
:
: But I was surprised then to see the following quote from "Erik Hatcher" in
: the archives:
:
:   "In fact, DateFilter by itself is practically of no use, I think." [4]
:
: ...Erik goes on to suggest that given "a set of canned date ranges", it
: doesn't really matter if you use a RangeQuery or a DateFilter -- as long
: as you cache them to reuse them (with something like CachingWrappingFilter
: or QueryFilter).  I'm hoping that he might elaborate on that comment?
:
: As a test, I wrote a "RangeFilter" which borrows heavily from DateFilter
: to both convince myself it could work, and to do a comparison between it
: and RangeQuery. [5] Based on my limited tests, using a Filter to restrict
: to a Range is a lot faster then using RangeQuery -- independent of
: caching.
:
: The attachment contains my RangeFilter, a unit test that demonstrates it,
: and a Benchmarking unit test that does a side-by-side comparison with
: RangeQuery [6].  If developers feel that this class is useful, then by all
: means roll it into the code base.  (90% of it is cut/pasted from
: DateFilter/RangeQuery anyway)
:
:
: Comments? ... Questions? ... Answers?
:
:
:
: Footnotes:
:
: [1] It

Numeric Range Restrictions: Queries vs Filters

2004-11-22 Thread Hoss
(NOTE: numbers in [] indicate Footnotes)

I'm rather new to Lucene (and this list), so if I'm grossly
misunderstanding things, forgive me.

One of my main needs as I investigate Search technologies is to restrict
results based on Ranges of numeric values.  Looking over the archives of
this list, it seems that lots of people have run into problems dealing
with this.  In particular, whenever someone asks a question about "Numeric
Ranges" the question seem to always involve one (or more) of the
following:

   (a) Lexical sorting puts 11 in the range "1 TO 5"
   (b) Dates (or Dates and Times)
   (c) BooleanQuery$TooManyClauses Exceptions
   (d) Should I use a filter?

(a) is a solved problem as long as you use a formatter like
LongField.java[1]

(b) is really nothing more then a special case of dealing with generic
numeric values.  While there are certainly special purposes solutions that
sometimes apply to dealing with Date ranges, any good solution for dealing
with raw numeric ranges can be applied to Dates (and Times)

(c) is a situation that seems to come up a lot because of the way
RangeQuery works.  The rewrite method walks all of the Terms in the index
starting with "lowerTerm" and builds up BooleanQuery containing a separate
TermQuery for every Term found, until it reaches the upperTerm.  This
causes a range search of "0001 TO 1000" to generate a BooleanQuery with N
clauses, where N is the quantity of unique values in the field which are
lexically greater then 0001 and lexically less then 1000.  depending on
the nature of your data, this might be 0 BooleanClauses, or it might be
1000 BooleanClauses; but the list is built before the search is ever even
executed.

At first, this may seem really strange -- I know I was certainly confused
-- but there is a very good reason for it: Ultimately RangeQuery still
provides you with a meaningful score for each document, based on the
frequency (and quantity) of terms that document has in the range [2].  In
order to do that, it has to expand itself, but what if you don't care if
your Range restriction impacts the Score? [3]

Which brings us to...

(c) Filtering.  Filters in general make a lot of sense to me.  They are a
way to specify (at query time) that only a certain subset of the index
should be considered for results.  The Filter class has a very straight
forward API that seems very easy to subclass to get the behavior I want.
The Query API on the other hand ... I freely admit, that I can't make
heads or tails out of it.  I don't even know where I would begin to try
and write a new subclass of Query if I wanted to.

I would think that most people who want to do a "numeric range
restriction" on their data, probably don't care about the Scoring benefits
of RangeQuery.  Looking at the code base, the way DateFilter works seems
like it provides an ideal solution to any sort of Range restriction (not
just Dates) that *should* be more efficient then using RangeQuery when
dealing with an unbounded value set. (Both approaches need to iterate over
all of the terms in the specified field using TermEnum, but RangeQuery has
to build up an set of BooleanQuery objects for each matching term, and
then each of those queries have to help score the documents -- DateFilter
on the other hand only has to maintain a single BitSet of documents that
it finds as it iterates)

But I was surprised then to see the following quote from "Erik Hatcher" in
the archives:

  "In fact, DateFilter by itself is practically of no use, I think." [4]

...Erik goes on to suggest that given "a set of canned date ranges", it
doesn't really matter if you use a RangeQuery or a DateFilter -- as long
as you cache them to reuse them (with something like CachingWrappingFilter
or QueryFilter).  I'm hoping that he might elaborate on that comment?

As a test, I wrote a "RangeFilter" which borrows heavily from DateFilter
to both convince myself it could work, and to do a comparison between it
and RangeQuery. [5] Based on my limited tests, using a Filter to restrict
to a Range is a lot faster then using RangeQuery -- independent of
caching.

The attachment contains my RangeFilter, a unit test that demonstrates it,
and a Benchmarking unit test that does a side-by-side comparison with
RangeQuery [6].  If developers feel that this class is useful, then by all
means roll it into the code base.  (90% of it is cut/pasted from
DateFilter/RangeQuery anyway)


Comments? ... Questions? ... Answers?



Footnotes:

[1] It seems to me this class is extremely useful, does anyone know
if there's a particular reason it hasn't been added to the main Lucene
codebase?
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg04790.html

[2] Take a look at RangeQueryScoreDemo.java in the attachment, which
produces output something like this...
   Range Search for: 'apple' TO 'dog'
   0.40924072 ... bed dog emu
   0.38014847 ... DOG
   0.2825246 ... cat
   0.17657787 ... apple emu
   0.12

Re: downloading Lucene 1.4.2

2004-11-22 Thread Hoss

In the same Lucene News section where the announcement about 1.4.2 is
listed, there is a link that says "Binary and source distributions are
available here." ...

http://cvs.apache.org/dist/jakarta/lucene/v1.4.2/

I got really confused yesterday after I already had the binary version
and i was looking for the source and found the link you listed.  Does
anyone know how to go about getting http://www.apache.org/dist/ updated?

: According to the Lucene homepage, Lucene 1.4.2 was released
: on October 1, 2004
:
: However, the "dist" on www.apache.org does not have a copy of
: Lucene 1.4.2
:
:http://www.apache.org/dist/jakarta/lucene/binaries/
:
: Where can I download Lucene 1.4.2?



--

---
"Oh, you're a tricky one."Chris M Hostetter
 -- Trisha Weir[EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]