Hits.doc(x) and range queries

2004-09-14 Thread roy-lucene-user
Hi guys!

I've posted previously that Hits.doc(x) was taking a long time.  Turns out it
has to do with a date range in our query.  We usually do date ranges like this:
Date:[(lucene date field) - (lucene date field)]

Sometimes the begin date is 0 which is what we get from
DateField.dateToString( ( new Date( 0 ) ).

This is when getting our search results from the Hits object takes an absurd
amount of time.  Its usually each time the Hits object attempts to get more
results from an IndexSearcher ( aka, every 100? ).

It also takes up more memory...

I was wondering why it affects the search so much even though we're only
returning 350 or so results.  Does the QueryParser do something similar to the
DateFilter on range queries?  Would it be better to use a DateFilter?

We're using Lucene 1.2 (with plans to upgrade).  Do newer versions of Lucene
have this problem?

Roy.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Open-ended range queries

2004-06-11 Thread Erik Hatcher
On Jun 10, 2004, at 10:37 PM, Terry Steichen wrote:
Speaking for myself, only a small number of my code modules currently 
treat
null as the open-ended range query term parameter.  If the syntax 
change
from 'null' -- '*' was deemed otherwise desirable and the syntax 
transition
made very clearly, I could personally adjust to it without too much
difficulty.

I agree that the proposed '*' syntax does seem more logical.  If a 
change to
that syntax were made such that the old null syntax for the upper 
bound
was retained for backward compatibility, such a transition would be
completely painless.
Just to clarify, since Terry's response implies this is not 
understood there is *nothing* special about null currently.  It 
is simply being treated as term text.  So adding special * handling 
would NOT change how null currently works.

In June of 2002 (!) null and NULL (and nULL, Null, etc) were 
removed as being special from what I see in the diff.

Furthermore, to achieve the proposed * handling, you can do this 
yourself now by subclassing QueryParser and overriding getRangeQuery:

  protected Query getRangeQuery(String field, Analyzer analyzer,
String part1, String part2,
boolean inclusive)
  throws ParseException {
  return new RangeQuery(
  *.equals(part1) ? null : new Term(field, part1),
  *.equals(part2) ? null : new Term(field, part2),
  inclusive);
  }
(a little more is needed if you want to keep the date range handling).
Note, you cannot do field:[* TO *] to make it wide-open - RangeQuery 
does not allow this.

My proposal is this (_after_ 1.4 goes final):
  - Add the above logic to QueryParser.
  - Modify RangeQuery.toString to output the * when the term is null, 
and also if the start term is  (RangeQuery's constructor modifies the 
beginning term to  if it is null).

If there are no objections to this plan, I'll add this as a Bugzilla 
issue as a reminder.  I don't want to touch 1.4's codebase - no point 
in adding a feature at this stage that can already be achieved with the 
simple code above.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Another way to handle large numeric range queries

2004-06-10 Thread Erik Hatcher
On Jun 9, 2004, at 1:05 PM, Don Gilbert wrote:
I'm particularly interested in the XPath stuff I saw in LGQueryParser.
   * xpathFieldParse
   'xpath' parser: param allfields[], with query or field[] possibly
having wild-card notation:   *.start  annotation.*.text
allowing '/' and '.' field separator
This is an *unfinished* attempt to support xpath style queries with
wild-cards or parts when you have indexed XML data, such as
  query: /annotation/*/text:term
I had to put this aside when I saw the problem of pulling the xpath
fields from a query string would take a fair amount of thought and 
code.
Keep us posted if/when you return to this.  I would love to see XPath 
query capability on hierarchical content indexed in Lucene.  Jakarta 
Slide, for example, is starting to build in Lucene indexing capability 
- although I don't think they are doing anything with XPath query 
expressions yet.

Anyone doing anything similar?  Implementation ideas?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Open-ended range queries

2004-06-10 Thread Scott ganyo
At one point it definitely supported null for either term.  I think 
that has been removed/forgotten in the later revisions of the 
QueryParser...

Scott
On Jun 10, 2004, at 1:24 PM, Erik Hatcher wrote:
On Jun 10, 2004, at 2:13 PM, Terry Steichen wrote:
Actually, QueryParser does support open-ended ranges like :  [term TO 
null].
Doesn't work for the lower end of the range (though that's usually 
less of a
problem).
It supports null?  Are you sure?  If so, I'm very confused about it 
because I don't see where in the grammar it has any special handling 
like that.  Could you show an example that demonstrates this?

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: Open-ended range queries

2004-06-10 Thread Terry Steichen
Well, I'm using 1.4 RC3 and the null range upper limit works just fine for
searches in two of my fields; one is in the form of a cannonical date (eg,
20040610) and the other is in the form of a padded word count (e.g., 01500
for 1500).  The syntax would be pub_date:[20040501 TO null] (dates later
than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 or more
words).

Regards,

Terry

PS: This use of null has worked this way since at least 1.2.  As I recall,
way back when, null also worked as the first term limit (but no longer
does).

- Original Message - 
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, June 10, 2004 2:24 PM
Subject: Re: Open-ended range queries


 On Jun 10, 2004, at 2:13 PM, Terry Steichen wrote:
  Actually, QueryParser does support open-ended ranges like :  [term TO
  null].
  Doesn't work for the lower end of the range (though that's usually
  less of a
  problem).

 It supports null?  Are you sure?  If so, I'm very confused about it
 because I don't see where in the grammar it has any special handling
 like that.  Could you show an example that demonstrates this?

 Erik



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Open-ended range queries

2004-06-10 Thread Erik Hatcher
On Jun 10, 2004, at 4:07 PM, Terry Steichen wrote:
Well, I'm using 1.4 RC3 and the null range upper limit works just 
fine for
searches in two of my fields; one is in the form of a cannonical date 
(eg,
20040610) and the other is in the form of a padded word count (e.g., 
01500
for 1500).  The syntax would be pub_date:[20040501 TO null] (dates 
later
than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 
or more
words).
Ah
It works for you because you have numeric values and lexically null 
is greater than any of them.  It is still using it as a lexical term 
value, and not truly making the end open-ended.

This is why null doesn't work at the beginning for you either.  It's 
just being treated as text, just like your numbers are.

PS: This use of null has worked this way since at least 1.2.  As I 
recall,
way back when, null also worked as the first term limit (but no 
longer
does).
If so, then something serious broke.  I've not the time to check the 
cvs logs on this, but I cannot imagine that we removed something like 
this.  If anyone cares to dig up the diff where we removed/broke this, 
I'd be gracious.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Open-ended range queries

2004-06-10 Thread Scott ganyo
It looks to me like Revision 1.18 broke it.
On Jun 10, 2004, at 3:26 PM, Erik Hatcher wrote:
On Jun 10, 2004, at 4:07 PM, Terry Steichen wrote:
Well, I'm using 1.4 RC3 and the null range upper limit works just 
fine for
searches in two of my fields; one is in the form of a cannonical date 
(eg,
20040610) and the other is in the form of a padded word count (e.g., 
01500
for 1500).  The syntax would be pub_date:[20040501 TO null] (dates 
later
than April 30, 2004) and s_words:[01000 TO null] (articles with 1000 
or more
words).
Ah
It works for you because you have numeric values and lexically 
null is greater than any of them.  It is still using it as a lexical 
term value, and not truly making the end open-ended.

This is why null doesn't work at the beginning for you either.  It's 
just being treated as text, just like your numbers are.

PS: This use of null has worked this way since at least 1.2.  As I 
recall,
way back when, null also worked as the first term limit (but no 
longer
does).
If so, then something serious broke.  I've not the time to check the 
cvs logs on this, but I cannot imagine that we removed something like 
this.  If anyone cares to dig up the diff where we removed/broke this, 
I'd be gracious.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


smime.p7s
Description: S/MIME cryptographic signature


Re: Open-ended range queries

2004-06-10 Thread Scott ganyo
Well, I do like the *, but apparently there are some people that are  
using this with the null...

Scott
On Jun 10, 2004, at 7:15 PM, Erik Hatcher wrote:
On Jun 10, 2004, at 4:54 PM, Scott ganyo wrote:
It looks to me like Revision 1.18 broke it.
It seems this could be it:
revision 1.18
date: 2002/06/25 00:05:31;  author: briangoetz;  state: Exp;  lines:  
+62 -33
Support for new range query syntax.  The delimiter is  TO , but is  
optional
for backward compatibility with previous syntax.  If the range  
arguments
match the format supported by  
DateFormat.getDateInstance(DateFormat.SHORT),
then they will be converted into the appropriate date strings a la  
DateField.

Added Field.Keyword constructor for Date-valued arguments.
Optimized DateField.timeToString function.
But geez June 2002 and no one has complained since?
Given that this is so outdated, I'm not sure what the right course of  
action is.  There are lots more Lucene users now than there were then.  
 Would adding NULL back be what folks want?  What about simply an  
asterisk to denote open ended-ness?  [* TO term] or [term TO *]

For completeness, here is the diff:
% cvs diff -u -r 1.17 -r 1.18 QueryParser.jj
Index: QueryParser.jj
===
RCS file:  
/home/cvs/jakarta-lucene/src/java/org/apache/lucene/queryParser/ 
QueryParser.jj,v
retrieving revision 1.17
retrieving revision 1.18
diff -u -r1.17 -r1.18
--- QueryParser.jj  20 May 2002 15:45:43 -  1.17
+++ QueryParser.jj  25 Jun 2002 00:05:31 -  1.18
@@ -65,8 +65,11 @@

 import java.util.Vector;
 import java.io.*;
+import java.text.*;
+import java.util.*;
 import org.apache.lucene.index.Term;
 import org.apache.lucene.analysis.*;
+import org.apache.lucene.document.*;
 import org.apache.lucene.search.*;
 /**
@@ -218,35 +221,30 @@
   private Query getRangeQuery(String field,
   Analyzer analyzer,
-  String queryText,
+  String part1,
+  String part2,
   boolean inclusive)
   {
-// Use the analyzer to get all the tokens.  There should be 1 or  
2.
-TokenStream source = analyzer.tokenStream(field,
-  new  
StringReader(queryText));
-Term[] terms = new Term[2];
-org.apache.lucene.analysis.Token t;
+boolean isDate = false, isNumber = false;

-for (int i = 0; i  2; i++)
-{
-  try
-  {
-t = source.next();
-  }
-  catch (IOException e)
-  {
-t = null;
-  }
-  if (t != null)
-  {
-String text = t.termText();
-if (!text.equalsIgnoreCase(NULL))
-{
-  terms[i] = new Term(field, text);
-}
-  }
+try {
+  DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT);
+  df.setLenient(true);
+  Date d1 = df.parse(part1);
+  Date d2 = df.parse(part2);
+  part1 = DateField.dateToString(d1);
+  part2 = DateField.dateToString(d2);
+  isDate = true;
 }
-return new RangeQuery(terms[0], terms[1], inclusive);
+catch (Exception e) { }
+
+if (!isDate) {
+  // @@@ Add number support
+}
+
+return new RangeQuery(new Term(field, part1),
+  new Term(field, part2),
+  inclusive);
   }
   public static void main(String[] args) throws Exception {
@@ -282,7 +280,7 @@
 | #_WHITESPACE: (   | \t ) 
 }
-DEFAULT SKIP : {
+DEFAULT, RangeIn, RangeEx SKIP : {
   _WHITESPACE
 }
@@ -303,14 +301,28 @@
 | PREFIXTERM:  _TERM_START_CHAR (_TERM_CHAR)* * 
 | WILDTERM:  _TERM_START_CHAR
   (_TERM_CHAR | ( [ *, ? ] ))* 
-| RANGEIN:   [ ( ~[ ] ] )+ ]
-| RANGEEX:   { ( ~[ } ] )+ }
+| RANGEIN_START: [  : RangeIn
+| RANGEEX_START: {  : RangeEx
 }
 Boost TOKEN : {
 NUMBER:(_NUM_CHAR)+ ( . (_NUM_CHAR)+ )?  : DEFAULT
 }
+RangeIn TOKEN : {
+RANGEIN_TO: TO
+| RANGEIN_END: ] : DEFAULT
+| RANGEIN_QUOTED: \ (~[\])+ \
+| RANGEIN_GOOP: (~[  , ] ])+ 
+}
+
+RangeEx TOKEN : {
+RANGEEX_TO: TO
+| RANGEEX_END: } : DEFAULT
+| RANGEEX_QUOTED: \ (~[\])+ \
+| RANGEEX_GOOP: (~[  , } ])+ 
+}
+
 // *   Query  ::= ( Clause )*
 // *   Clause ::= [+, -] [TERM :] ( TERM | ( Query ) )
@@ -387,7 +399,7 @@
 Query Term(String field) : {
-  Token term, boost=null, slop=null;
+  Token term, boost=null, slop=null, goop1, goop2;
   boolean prefix = false;
   boolean wildcard = false;
   boolean fuzzy = false;
@@ -415,12 +427,29 @@
else
  q = getFieldQuery(field, analyzer, term.image);
  }
- | ( term=RANGEIN { rangein=true; } | term=RANGEEX )
+ | ( RANGEIN_START (  
goop1=RANGEIN_GOOP|goop1=RANGEIN_QUOTED )
+ [ RANGEIN_TO ] (  
goop2=RANGEIN_GOOP|goop2=RANGEIN_QUOTED )
+ RANGEIN_END )
+   [ CARAT boost=NUMBER ]
+{
+  if (goop1.kind == RANGEIN_QUOTED)
+goop1.image = goop1.image.substring(1,  

Re: Open-ended range queries

2004-06-10 Thread Terry Steichen
Speaking for myself, only a small number of my code modules currently treat
null as the open-ended range query term parameter.  If the syntax change
from 'null' -- '*' was deemed otherwise desirable and the syntax transition
made very clearly, I could personally adjust to it without too much
difficulty.

I agree that the proposed '*' syntax does seem more logical.  If a change to
that syntax were made such that the old null syntax for the upper bound
was retained for backward compatibility, such a transition would be
completely painless.

Regards,

Terry

- Original Message - 
From: Scott ganyo [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, June 10, 2004 8:57 PM
Subject: Re: Open-ended range queries


 Well, I do like the *, but apparently there are some people that are
 using this with the null...

 Scott

 On Jun 10, 2004, at 7:15 PM, Erik Hatcher wrote:

  On Jun 10, 2004, at 4:54 PM, Scott ganyo wrote:
  It looks to me like Revision 1.18 broke it.
 
  It seems this could be it:
 
  revision 1.18
  date: 2002/06/25 00:05:31;  author: briangoetz;  state: Exp;  lines:
  +62 -33
  Support for new range query syntax.  The delimiter is  TO , but is
  optional
  for backward compatibility with previous syntax.  If the range
  arguments
  match the format supported by
  DateFormat.getDateInstance(DateFormat.SHORT),
  then they will be converted into the appropriate date strings a la
  DateField.
 
  Added Field.Keyword constructor for Date-valued arguments.
 
  Optimized DateField.timeToString function.
 
 
  But geez June 2002 and no one has complained since?
 
  Given that this is so outdated, I'm not sure what the right course of
  action is.  There are lots more Lucene users now than there were then.
   Would adding NULL back be what folks want?  What about simply an
  asterisk to denote open ended-ness?  [* TO term] or [term TO *]
 
  For completeness, here is the diff:
 
  % cvs diff -u -r 1.17 -r 1.18 QueryParser.jj
  Index: QueryParser.jj
  ===
  RCS file:
  /home/cvs/jakarta-lucene/src/java/org/apache/lucene/queryParser/
  QueryParser.jj,v
  retrieving revision 1.17
  retrieving revision 1.18
  diff -u -r1.17 -r1.18
  --- QueryParser.jj  20 May 2002 15:45:43 -  1.17
  +++ QueryParser.jj  25 Jun 2002 00:05:31 -  1.18
  @@ -65,8 +65,11 @@
 
   import java.util.Vector;
   import java.io.*;
  +import java.text.*;
  +import java.util.*;
   import org.apache.lucene.index.Term;
   import org.apache.lucene.analysis.*;
  +import org.apache.lucene.document.*;
   import org.apache.lucene.search.*;
 
   /**
  @@ -218,35 +221,30 @@
 
 private Query getRangeQuery(String field,
 Analyzer analyzer,
  -  String queryText,
  +  String part1,
  +  String part2,
 boolean inclusive)
 {
  -// Use the analyzer to get all the tokens.  There should be 1 or
  2.
  -TokenStream source = analyzer.tokenStream(field,
  -  new
  StringReader(queryText));
  -Term[] terms = new Term[2];
  -org.apache.lucene.analysis.Token t;
  +boolean isDate = false, isNumber = false;
 
  -for (int i = 0; i  2; i++)
  -{
  -  try
  -  {
  -t = source.next();
  -  }
  -  catch (IOException e)
  -  {
  -t = null;
  -  }
  -  if (t != null)
  -  {
  -String text = t.termText();
  -if (!text.equalsIgnoreCase(NULL))
  -{
  -  terms[i] = new Term(field, text);
  -}
  -  }
  +try {
  +  DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT);
  +  df.setLenient(true);
  +  Date d1 = df.parse(part1);
  +  Date d2 = df.parse(part2);
  +  part1 = DateField.dateToString(d1);
  +  part2 = DateField.dateToString(d2);
  +  isDate = true;
   }
  -return new RangeQuery(terms[0], terms[1], inclusive);
  +catch (Exception e) { }
  +
  +if (!isDate) {
  +  // @@@ Add number support
  +}
  +
  +return new RangeQuery(new Term(field, part1),
  +  new Term(field, part2),
  +  inclusive);
 }
 
 public static void main(String[] args) throws Exception {
  @@ -282,7 +280,7 @@
   | #_WHITESPACE: (   | \t ) 
   }
 
  -DEFAULT SKIP : {
  +DEFAULT, RangeIn, RangeEx SKIP : {
 _WHITESPACE
   }
 
  @@ -303,14 +301,28 @@
   | PREFIXTERM:  _TERM_START_CHAR (_TERM_CHAR)* * 
   | WILDTERM:  _TERM_START_CHAR
 (_TERM_CHAR | ( [ *, ? ] ))* 
  -| RANGEIN:   [ ( ~[ ] ] )+ ]
  -| RANGEEX:   { ( ~[ } ] )+ }
  +| RANGEIN_START: [  : RangeIn
  +| RANGEEX_START: {  : RangeEx
   }
 
   Boost TOKEN : {
   NUMBER:(_NUM_CHAR)+ ( . (_NUM_CHAR)+ )?  : DEFAULT
   }
 
  +RangeIn TOKEN : {
  +RANGEIN_TO

Re: Another way to handle large numeric range queries

2004-06-09 Thread Erik Hatcher
On Jun 8, 2004, at 10:55 PM, Don Gilbert wrote:
Find this as part of the 'LuceGene' package for searching
genome and bioinformatics databases at http://www.gmod.org/lucegene/
with lucene related source code in cvs here:
Nice stuff!
http://cvs.sourceforge.net/viewcvs.py/gmod/lucegene/src/org/eugenes/ 
index/
LGQueryParser.java -- extension of QueryParser for NumRangeQuery (  
other)
I'm particularly interested in the XPath stuff I saw in LGQueryParser.   
Could you tell us more about what you do with that and how it works?

BioDataAnalyzer.java -- NumberField formats field for indexing
*whew* - that is one complex piece of code.  I like the DebugFilter  
idea.

Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Another way to handle large numeric range queries

2004-06-09 Thread Don Gilbert

Erik,

Thanks for the comments.  

 I'm particularly interested in the XPath stuff I saw in LGQueryParser.   

   * xpathFieldParse
   'xpath' parser: param allfields[], with query or field[] possibly
having wild-card notation:   *.start  annotation.*.text 
allowing '/' and '.' field separator

This is an *unfinished* attempt to support xpath style queries with
wild-cards or parts when you have indexed XML data, such as 

  query: /annotation/*/text:term
   
I had to put this aside when I saw the problem of pulling the xpath
fields from a query string would take a fair amount of thought and code.

  BioDataAnalyzer.java -- NumberField formats field for indexing
 
 *whew* - that is one complex piece of code.  I like the DebugFilter  

Mostly it is just a collection of small 10 line classes, packaged
as inner classes (I hate java's insistence on 1 file/class :)
Some of the complexity there is because the standard lucene analyzer
won't work for biology data (which uses a lot of symbols, upper/lowercase,
etc.) and this code allows one to build an analyzer/indexer 
which is tuned to different types in each field of data.
The configuration for a given biology database parsing includes 
statements like:

## field tokenizers - base CharTokenizer, work before Filters 
tokenizer.SYM=org.eugenes.index.BiodataAnalyzer$DataTokenizer

## field filters - base TokenFilter, only are used if fieldtype=Text or UnStored
tokenfilter.BLOC.start=org.eugenes.index.BiodataAnalyzer$NumberFilter

## fieldrecoder classes manipulate data before indexing, maybe making new fields
fieldrecoder.BLOC=LucegeneIndexers$Location_FieldRecoder

This method then generates TokenStream using such field-specific parsers, 
  public TokenStream tokenStream( String fieldName, Reader reader) {
TokenStream result = null;
try { result= getTokenizer(fieldName, reader);  }
catch (Exception e) {
  result = new org.apache.lucene.analysis.standard.StandardTokenizer(reader);
  }
try { result= getFilter(fieldName, result); }
catch (Exception e) {
  LowerDataFilter ldf= new LowerDataFilter(); 
  ldf.setInput(result); result= ldf;
  }
return result;
  }


-- Don Gilbert
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- [EMAIL PROTECTED]://marmot.bio.indiana.edu/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Another way to handle large numeric range queries

2004-06-09 Thread Claude Devarenne
 first, Term last, boolean inc);
-- full numeric (integer) range query, can handle large ranges.
-- makes a BitSet of documents within range once, and feeds back to
Searcher thru score(HitCollector c, int end) as often as called.
-- query semantics are same as for RangeQuery
-- implicit assumptions are
 -- first, last Term have integer values, as does indexed field
 -- indexed field is recoded for alphanumeric sorting;
   e.g.  2 - 02, 10 - 10, -3 -  -03
Find this as part of the 'LuceGene' package for searching
genome and bioinformatics databases at http://www.gmod.org/lucegene/
with lucene related source code in cvs here:
http://cvs.sourceforge.net/viewcvs.py/gmod/lucegene/src/org/eugenes/ 
index/
NumRangeQuery.java -- range searches of integer fields.
LGQueryParser.java -- extension of QueryParser for NumRangeQuery (  
other)
BioDataAnalyzer.java -- NumberField formats field for indexing

-- Don Gilbert
Date: Tue, 18 May 2004 13:35:55 -0700
From: Andy Goodell [EMAIL PROTECTED]
Subject: How to handle range queries over large ranges and avoid Too  
Many Boolean cla

In our application we had a similar problem with non-date ranges until
we realized that it wasnt so much that we were searching for the
values in the range as restricting the search to that range, and then
we used an extension to the org.apache.lucene.search.Filter class, and
our implementation got much simpler and faster.
-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- [EMAIL PROTECTED]://marmot.bio.indiana.edu/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

2004-05-19 Thread Morus Walter
Claude Devarenne writes:
 Hi,
 
 I have over 60,000 documents in my index which is slightly over a 1 GB 
 in size.  The documents range from the late seventies up to now.  I 
 have indexed dates as a keyword field using a string because the dates 
 are in MMDD format.  When I do range queries things are OK as long 
 as I don't exceed the built-in number of boolean clauses, so that's a 
 range of 3 years, e.g. 1979 to 1981.  The users are not only doing 
 complex queries but also want to query over long ranges, e.g. [19790101 
 TO 19991231].
 
 Given these requirements, I am thinking of doing a query without the 
 date range, bring the unique ids back from the hits and then do a date 
 query in the SQL database I have that contains the same data.  Another 
 alternative is to do the query without the date range in Lucene and 
 then sort the results within the range.  I still have to learn how to 
 use the new sorting code and confessed I did not have time to look at 
 it yet.
 
 Is there a simpler, easier way to do this?
 
I think it would be worth to take a look at the sorting code.

The idea of the sorting code is to have an array of the dates for each doc
in memory and access this array for sorting.
Now sorting isn't the only thing one might use this array for.
Doing a range check is another.
So you might extend the sorting code by a range selection.

There is no code for this in lucene and you have to create your own searcher
but it gives you a fast way to search and sort by date.

I did this independently from the new sorting code (I just started a little
to early) and it works quite well.
The only drawback from this (and the new sorting code) is, that it requires
an array of field values that must be rebuilt each time the index changes.
Shouldn't be a problem for 6 documents.

Morus

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

2004-05-19 Thread Claude Devarenne
Thanks,  I will look at the sorting code.  Sorting results by date is 
next on list.  For now, I only have a small number of documents but the 
set is to grow to over 8 million documents for the collection I am 
working on.  Another collection we have is 40 million documents or so.  
From what you say it seems to me that sorting will not scale then when 
I get to larger number of documents.  I am considering using an SQL 
back end to implement sorting: bring back the unique IDs from lucene 
and then sort in SQL.

Claude
On May 18, 2004, at 11:23 PM, Morus Walter wrote:
Claude Devarenne writes:
Hi,
I have over 60,000 documents in my index which is slightly over a 1 GB
in size.  The documents range from the late seventies up to now.  I
have indexed dates as a keyword field using a string because the dates
are in MMDD format.  When I do range queries things are OK as long
as I don't exceed the built-in number of boolean clauses, so that's a
range of 3 years, e.g. 1979 to 1981.  The users are not only doing
complex queries but also want to query over long ranges, e.g. 
[19790101
TO 19991231].

Given these requirements, I am thinking of doing a query without the
date range, bring the unique ids back from the hits and then do a date
query in the SQL database I have that contains the same data.  Another
alternative is to do the query without the date range in Lucene and
then sort the results within the range.  I still have to learn how to
use the new sorting code and confessed I did not have time to look at
it yet.
Is there a simpler, easier way to do this?
I think it would be worth to take a look at the sorting code.
The idea of the sorting code is to have an array of the dates for each 
doc
in memory and access this array for sorting.
Now sorting isn't the only thing one might use this array for.
Doing a range check is another.
So you might extend the sorting code by a range selection.

There is no code for this in lucene and you have to create your own 
searcher
but it gives you a fast way to search and sort by date.

I did this independently from the new sorting code (I just started a 
little
to early) and it works quite well.
The only drawback from this (and the new sorting code) is, that it 
requires
an array of field values that must be rebuilt each time the index 
changes.
Shouldn't be a problem for 6 documents.

Morus
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


How to handle range queries over large ranges and avoid Too Many Boolean clauses

2004-05-18 Thread Claude Devarenne
Hi,
I have over 60,000 documents in my index which is slightly over a 1 GB 
in size.  The documents range from the late seventies up to now.  I 
have indexed dates as a keyword field using a string because the dates 
are in MMDD format.  When I do range queries things are OK as long 
as I don't exceed the built-in number of boolean clauses, so that's a 
range of 3 years, e.g. 1979 to 1981.  The users are not only doing 
complex queries but also want to query over long ranges, e.g. [19790101 
TO 19991231].

Given these requirements, I am thinking of doing a query without the 
date range, bring the unique ids back from the hits and then do a date 
query in the SQL database I have that contains the same data.  Another 
alternative is to do the query without the date range in Lucene and 
then sort the results within the range.  I still have to learn how to 
use the new sorting code and confessed I did not have time to look at 
it yet.

Is there a simpler, easier way to do this?
Claude
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

2004-05-18 Thread Ype Kingma
On Tuesday 18 May 2004 19:38, Claude Devarenne wrote:
 Hi,

 I have over 60,000 documents in my index which is slightly over a 1 GB
 in size.  The documents range from the late seventies up to now.  I
 have indexed dates as a keyword field using a string because the dates
 are in MMDD format.  When I do range queries things are OK as long
 as I don't exceed the built-in number of boolean clauses, so that's a
 range of 3 years, e.g. 1979 to 1981.  The users are not only doing
 complex queries but also want to query over long ranges, e.g. [19790101
 TO 19991231].

 Given these requirements, I am thinking of doing a query without the
 date range, bring the unique ids back from the hits and then do a date
 query in the SQL database I have that contains the same data.  Another
 alternative is to do the query without the date range in Lucene and
 then sort the results within the range.  I still have to learn how to
 use the new sorting code and confessed I did not have time to look at
 it yet.

 Is there a simpler, easier way to do this?

I wouldn't know of a simpler and easier way, but there is another way
to reduce the number of clauses involved in long date ranges.
This can be done by indexing not only MMDD but also MM and
, and adapting the query range mechanism to use the shorter term
whenever possible. (YYY and MMD might also be useful.)


Kind regards,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

2004-05-18 Thread Claude Devarenne
Thanks, I'll try that.  It would nice too if I could extend field (it 
is a final class) and create a numerical field.  Is that not desirable?

Claude
On May 18, 2004, at 12:06 PM, Ype Kingma wrote:
On Tuesday 18 May 2004 19:38, Claude Devarenne wrote:
Hi,
I have over 60,000 documents in my index which is slightly over a 1 GB
in size.  The documents range from the late seventies up to now.  I
have indexed dates as a keyword field using a string because the dates
are in MMDD format.  When I do range queries things are OK as long
as I don't exceed the built-in number of boolean clauses, so that's a
range of 3 years, e.g. 1979 to 1981.  The users are not only doing
complex queries but also want to query over long ranges, e.g. 
[19790101
TO 19991231].

Given these requirements, I am thinking of doing a query without the
date range, bring the unique ids back from the hits and then do a date
query in the SQL database I have that contains the same data.  Another
alternative is to do the query without the date range in Lucene and
then sort the results within the range.  I still have to learn how to
use the new sorting code and confessed I did not have time to look at
it yet.
Is there a simpler, easier way to do this?
I wouldn't know of a simpler and easier way, but there is another way
to reduce the number of clauses involved in long date ranges.
This can be done by indexing not only MMDD but also MM and
, and adapting the query range mechanism to use the shorter term
whenever possible. (YYY and MMD might also be useful.)
Kind regards,
Ype
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

2004-05-18 Thread Andy Goodell
In our application we had a similar problem with non-date ranges until
we realized that it wasnt so much that we were searching for the
values in the range as restricting the search to that range, and then
we used an extension to the org.apache.lucene.search.Filter class, and
our implementation got much simpler and faster.

- andy g

On Tue, 18 May 2004 10:38:01 -0700, Claude Devarenne
[EMAIL PROTECTED] wrote:
 
 Hi,
 
 I have over 60,000 documents in my index which is slightly over a 1 GB
 in size.  The documents range from the late seventies up to now.  I
 have indexed dates as a keyword field using a string because the dates
 are in MMDD format.  When I do range queries things are OK as long
 as I don't exceed the built-in number of boolean clauses, so that's a
 range of 3 years, e.g. 1979 to 1981.  The users are not only doing
 complex queries but also want to query over long ranges, e.g. [19790101
 TO 19991231].
 
 Given these requirements, I am thinking of doing a query without the
 date range, bring the unique ids back from the hits and then do a date
 query in the SQL database I have that contains the same data.  Another
 alternative is to do the query without the date range in Lucene and
 then sort the results within the range.  I still have to learn how to
 use the new sorting code and confessed I did not have time to look at
 it yet.
 
 Is there a simpler, easier way to do this?
 
 Claude
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How to handle range queries over large ranges and avoid Too Many Boolean clauses

2004-05-18 Thread Matt Quail
 Is there a simpler, easier way to do this?
Yes. I have started implementing a QuickRangeQuery class, that doesn't 
have the BooleanQuery limitation, but scores every matching document as 1.0.

I will see if I can get it finished in the next 24 hours, and post back 
to this thread.

=Matt
PS: I'm not sure about the QuickRangeQuery class name... maybe 
NormalizedRangeQuery, RangeQuery2... *shrug*

Claude Devarenne wrote:
Hi,
I have over 60,000 documents in my index which is slightly over a 1 GB 
in size.  The documents range from the late seventies up to now.  I have 
indexed dates as a keyword field using a string because the dates are in 
MMDD format.  When I do range queries things are OK as long as I 
don't exceed the built-in number of boolean clauses, so that's a range 
of 3 years, e.g. 1979 to 1981.  The users are not only doing complex 
queries but also want to query over long ranges, e.g. [19790101 TO 
19991231].

Given these requirements, I am thinking of doing a query without the 
date range, bring the unique ids back from the hits and then do a date 
query in the SQL database I have that contains the same data.  Another 
alternative is to do the query without the date range in Lucene and then 
sort the results within the range.  I still have to learn how to use the 
new sorting code and confessed I did not have time to look at it yet.

Is there a simpler, easier way to do this?
Claude
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: efficient refinement, order by and range queries

2003-12-23 Thread Erik Hatcher
Geoffrey,

You've done quite a thorough analysis of Lucene.  I'll reply below with 
a few tidbits of Lucene trivia in hopes that will help

On Dec 22, 2003, at 3:15 PM, Geoffrey Peddle wrote:
One of our applications is a catalog search
application.In this application our documents are
catalog items.   Each item has a number of
fields/attributes associated with it.   For example
Supplier, Part number, Price, Description.   We use a
search metaphor where end-users iterate issuing
queries and getting feedback about what's available.
So initially we may tell them that 600,000 items are
available from  95 suppliers, and who those suppliers
are.   They may choose to do a free text search for
the phrase blue pen.  The result of that query may
be to tell them that there's 240 items available from
2 suppliers which match that phrase, and who those
suppliers are.  They may pick one of the suppliers to
see the list of blue pens available from that
supplier.
To accomplish search within search, or search refinement, using a 
QueryFilter will do very nicely.

In addition to wanting the set of attribute values
found in the result documents we would also want to
return counts of the number of documents each
attribute value occurs in in the result document set.
Again, I think a QueryFilter can work well.  There are surely several 
ways to go about getting the number of documents in each bucket - 
perhaps additional queries should be made to give you those numbers, or 
perhaps walking the returned documents to get the unique values.  
Walking the documents could be expensive performance-wise though.  
Doing some sub-queries would be quite fast though.

Efficient range queries.

application) it's important to have some support for
this.The trick here is that the criteria may be
very open ended.   For example all items with price
greater than $10 might involve tens of thousands of
prices.
One suggestion I've seen posted is during indexing to use an additional 
field as a group.  In this case, it would be a price range group.  
Say A means $0 - $10, B for $10 - $100, C for $100+, for example. 
 Then you would only have a few terms in that field and a query would 
be quite fast.  The drawback is that you need to know at index-time 
what the groups are.

A custom range Filter is another option - and could be created at 
runtime and kept around and only recreated when the index is modified.  
Look at the built-in DateFilter for an example to work with.  This is a 
more pleasant option than doing a RangeQuery when the number of terms 
in the range is large.


Order by attributes.

We need the ability to order the document results set
by a pre-defined set of numeric attributes and would
like the ability to order on alphabetic attributes as
well.
This is an area where Lucene falls short.  My best suggestion is to do 
the sorting yourself, which would require getting at all the documents 
in Hits, which for a large collection would be unreasonable.  There are 
tricks that can be played with boosting during indexing where you can 
tier the boosts of a field in order - but this is really only a hint to 
the scorer to factor the order into the equation but there are many 
other factors.

I'm afraid there is no easy solution here, that I'm aware of.

I have resources for code development and consider it
to be in Ariba's best interest to contribute any code
that we write in this area with the entire community.
Our time frame is to develop a proto-type in the next
couple of months for proof of concept and
benchmarking.
Excellent!  We hope that we can get Lucene under the covers of your 
products - please continue to post to us with more questions and 
hopefully eventually code improvements!

	Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


efficient refinement, order by and range queries

2003-12-22 Thread Geoffrey Peddle
I'm attempting to make Lucene the document search
solution for the Ariba application suite.  I have
identified functionality gaps in the areas of
refinement, order by and range queries.  Below I
attempt to describe our requirements and begin a
discussion on how to solve them.


Refinement/attribute value collection.
 
One of our applications is a catalog search
application.In this application our documents are
catalog items.   Each item has a number of
fields/attributes associated with it.   For example
Supplier, Part number, Price, Description.   We use a
search metaphor where end-users iterate issuing
queries and getting feedback about what's available.  
So initially we may tell them that 600,000 items are
available from  95 suppliers, and who those suppliers
are.   They may choose to do a free text search for
the phrase blue pen.  The result of that query may
be to tell them that there's 240 items available from
2 suppliers which match that phrase, and who those
suppliers are.  They may pick one of the suppliers to
see the list of blue pens available from that
supplier.
 
In addition to wanting the set of attribute values
found in the result documents we would also want to
return counts of the number of documents each
attribute value occurs in in the result document set. 
 To limit memory consumption for attributes with many
values in the result document set we could specify a
limit beyond which additional new values would not be
collected, say return at most 200 suppliers which
match the query.
 
Efficient range queries.
 
There are certain attributes in our catalog item which
we want to support range queries on.  For example
price.While it would be unusual for this to be the
primary search criteria (people rarely say show me
anything that costs greater than $10 in our
application) it's important to have some support for
this.The trick here is that the criteria may be
very open ended.   For example all items with price
greater than $10 might involve tens of thousands of
prices.
 
 
Order by attributes.
 
We need the ability to order the document results set
by a pre-defined set of numeric attributes and would
like the ability to order on alphabetic attributes as
well.
 
 
What we are currently thinking of as the solution to
these first 3 gaps is an optional document based
payload or term vector.  We think we could use this in
a hit collector to post-filter any results outside our
range queries, collect refinement values and order the
results.   It seems we could build an external
mechanism similar to what's used in the search beans
for this or extend the core product.Does this seem
like a reasonable approach?   Any educated guesses on
what the percent increase in search response time will
be if we do a good job?
 
The trickiest part seems to be efficiently collecting
the refinement data for a set of 5-15 attributes.  Any
suggestions on data structures to use to
encode/collect this?   As we develop a design for this
I'll post it for additional feedback.

I have resources for code development and consider it
to be in Ariba's best interest to contribute any code
that we write in this area with the entire community. 
Our time frame is to develop a proto-type in the next
couple of months for proof of concept and
benchmarking.

-Geoffrey






__ 
Post your free ad now! http://personals.yahoo.ca

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Date fields (was: Re: Range queries)

2003-01-23 Thread Erik Hatcher
Again my apologies to Terry who was patient and confident despite my 
misunderstanding.

I saw a change earlier today (on lucene-dev) with the QueryParser 
syntax updated with more details on the range query format.  I think 
the ambiguity of saying a date field is still there though, since the 
Field.Keyword construct supports Date natively now as well as homegrown 
MMDD String fields.

My question now is: what benefit is the Field.Keyword(String,Date) if 
its easier to deal with dates in MMDD format?

Its nice that QueryParser can deal with that type of field using the 
SHORT DateFormat - so I can see a slight advantage when QueryParser is 
involved to representing dates as Date's rather than String's, but the 
range query constraints of the open-ended begin or end dates negate 
that benefit it seems.

I'm just trying to understand in more detail the finer points of Lucene 
so that I can make more effective use of its features as well as help 
educate others on it.

	Erik


On Thursday, January 23, 2003, at 06:03  AM, Terry Steichen wrote:\
Erik,

That's good.  Now I don't have to keep proving what is, is.  Glad it 
finally
made sense.

Regards,

Terry


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Range queries

2003-01-22 Thread Terry Steichen
Tatu,

I believe the range query syntax for the latest Lucene version is
field:[lower TO upper], or field:[null TO upper], or field:[lower TO
null].  In earlier versions replace TO with a dash (-).

I also believe that multiple wildcards (? and/or *) work just fine (as
long as they aren't the first character of the term).

HTH,

Terry

- Original Message -
From: Tatu Saloranta [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, January 22, 2003 11:48 PM
Subject: Range queries


 My apologies if this is a FAQ (which is possible as I am new to Lucene,
 however, I tried checking the web page for the answer).

 I read through the Query syntax web page first, and then checked the
 matching query classes. It seems like query syntax page is missing some
 details; the one I was wondering about was the range query. Since query
 parser seems to construct these queries, I guess they have been
implemented,
 even though syntax page didn't explain them. Is that correct?

 Looking at QueryParser, it seems that inclusive range query uses [ and ],
and
 exclusive query { and }? Is this right? And does it expect exactly two
 arguments?
 Also, am I right in assuming that range uses lexiographic ordering, so
that it
 basically includes all possible words (terms) between specified terms
(which
 will work ok with numbers/dates as long as they have been padded with
zeroes
 or such)?

 Another question I have is regarding wildcard search. Page mentions that
there
 is a restriction that search term can not start with a wild card (as that
 would render index useless I guess... would need to full scan?). However,
it
 doesn't mention if multiple wildcards are allowed? All the example cases
just
 have single wild card?

 Sorry for the newbie questions,

 -+ Tatu +-

 ps. Thanks for the developers for the neat indexing engine. I am currently
 evaluating it for use in a large-scale enterprise content management
system.


 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Range queries

2003-01-22 Thread Erik Hatcher
Unfortunately I don't believe date field range queries work with 
QueryParser, or at least not human-readable dates.

Is that correct?

I think it supports date ranges if they are turned into a numeric 
format, but no human would type that kind of query in.  I'm sure 
supporting true date range queries gets tricky with locale issues and 
such too.

	Erik


On Wednesday, January 22, 2003, at 09:19  AM, Terry Steichen wrote:
Tatu,

I believe the range query syntax for the latest Lucene version is
field:[lower TO upper], or field:[null TO upper], or field:[lower 
TO
null].  In earlier versions replace TO with a dash (-).

I also believe that multiple wildcards (? and/or *) work just fine 
(as
long as they aren't the first character of the term).

HTH,

Terry

- Original Message -
From: Tatu Saloranta [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, January 22, 2003 11:48 PM
Subject: Range queries


My apologies if this is a FAQ (which is possible as I am new to 
Lucene,
however, I tried checking the web page for the answer).

I read through the Query syntax web page first, and then checked the
matching query classes. It seems like query syntax page is missing 
some
details; the one I was wondering about was the range query. Since 
query
parser seems to construct these queries, I guess they have been
implemented,

even though syntax page didn't explain them. Is that correct?

Looking at QueryParser, it seems that inclusive range query uses [ 
and ],
and

exclusive query { and }? Is this right? And does it expect exactly two
arguments?
Also, am I right in assuming that range uses lexiographic ordering, so

that it

basically includes all possible words (terms) between specified terms

(which

will work ok with numbers/dates as long as they have been padded with

zeroes

or such)?

Another question I have is regarding wildcard search. Page mentions 
that
there

is a restriction that search term can not start with a wild card (as 
that
would render index useless I guess... would need to full scan?). 
However,
it

doesn't mention if multiple wildcards are allowed? All the example 
cases
just

have single wild card?

Sorry for the newbie questions,

-+ Tatu +-

ps. Thanks for the developers for the neat indexing engine. I am 
currently
evaluating it for use in a large-scale enterprise content management
system.



--
To unsubscribe, e-mail:

mailto:[EMAIL PROTECTED]

For additional commands, e-mail:

mailto:[EMAIL PROTECTED]






--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Range queries

2003-01-22 Thread Erik Hatcher
Terry and Michael,

Thanks for the clarification.  For some reason I though that date 
fields were represented more oddly than this format.  I stand corrected!

	Erik


On Wednesday, January 22, 2003, at 10:27  AM, Michael Barry wrote:
I utilize the earlier version and queries such as this work fine with 
QueryParser:

field:[ 20030120 - 20030125 ]

of course the back-end indexer canonocalizes all date fields to 
MMDD.
The front-end search code is responsible for canonocalizing the user 
inputed
dates to MMDD. I think the key here would be either to not allow 
users to
enter free-form dates (provide some type of UI element to enter year, 
month,
day seperately) or give some copy stating dates should be in MMDD 
format.

-Mike.

Erik Hatcher wrote:

Unfortunately I don't believe date field range queries work with 
QueryParser, or at least not human-readable dates.

Is that correct?

I think it supports date ranges if they are turned into a numeric 
format, but no human would type that kind of query in.  I'm sure 
supporting true date range queries gets tricky with locale issues and 
such too.

Erik


On Wednesday, January 22, 2003, at 09:19  AM, Terry Steichen wrote:

Tatu,

I believe the range query syntax for the latest Lucene version is
field:[lower TO upper], or field:[null TO upper], or 
field:[lower TO
null].  In earlier versions replace TO with a dash (-).

I also believe that multiple wildcards (? and/or *) work just 
fine (as
long as they aren't the first character of the term).

HTH,

Terry

- Original Message -
From: Tatu Saloranta [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, January 22, 2003 11:48 PM
Subject: Range queries


My apologies if this is a FAQ (which is possible as I am new to 
Lucene,
however, I tried checking the web page for the answer).

I read through the Query syntax web page first, and then checked 
the
matching query classes. It seems like query syntax page is missing 
some
details; the one I was wondering about was the range query. Since 
query
parser seems to construct these queries, I guess they have been

implemented,


even though syntax page didn't explain them. Is that correct?

Looking at QueryParser, it seems that inclusive range query uses [ 
and ],

and


exclusive query { and }? Is this right? And does it expect exactly 
two
arguments?
Also, am I right in assuming that range uses lexiographic ordering, 
so

that it


basically includes all possible words (terms) between specified 
terms

(which


will work ok with numbers/dates as long as they have been padded 
with

zeroes


or such)?

Another question I have is regarding wildcard search. Page mentions 
that

there


is a restriction that search term can not start with a wild card 
(as that
would render index useless I guess... would need to full scan?). 
However,

it


doesn't mention if multiple wildcards are allowed? All the example 
cases

just


have single wild card?

Sorry for the newbie questions,

-+ Tatu +-

ps. Thanks for the developers for the neat indexing engine. I am 
currently
evaluating it for use in a large-scale enterprise content management

system.




--
To unsubscribe, e-mail:


mailto:[EMAIL PROTECTED]


For additional commands, e-mail:


mailto:[EMAIL PROTECTED]







--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Range queries

2003-01-22 Thread Hui Ouyang
Does this have better performance than using DateFilter?
Regards,
Hui

-Original Message- 
From: Michael Barry [mailto:[EMAIL PROTECTED]] 
Sent: Wed 1/22/2003 10:27 AM 
To: Lucene Users List 
Cc: 
Subject: Re: Range queries



I utilize the earlier version and queries such as this work fine with
QueryParser:

field:[ 20030120 - 20030125 ]

of course the back-end indexer canonocalizes all date fields to MMDD.
The front-end search code is responsible for canonocalizing the user inputed
dates to MMDD. I think the key here would be either to not allow
users to
enter free-form dates (provide some type of UI element to enter year, month,
day seperately) or give some copy stating dates should be in MMDD
format.

-Mike.

Erik Hatcher wrote:

 Unfortunately I don't believe date field range queries work with
 QueryParser, or at least not human-readable dates.

 Is that correct?

 I think it supports date ranges if they are turned into a numeric
 format, but no human would type that kind of query in.  I'm sure
 supporting true date range queries gets tricky with locale issues and
 such too.

 Erik


 On Wednesday, January 22, 2003, at 09:19  AM, Terry Steichen wrote:

 Tatu,

 I believe the range query syntax for the latest Lucene version is
 field:[lower TO upper], or field:[null TO upper], or
 field:[lower TO
 null].  In earlier versions replace TO with a dash (-).

 I also believe that multiple wildcards (? and/or *) work just
 fine (as
 long as they aren't the first character of the term).

 HTH,

 Terry

 - Original Message -
 From: Tatu Saloranta [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: Wednesday, January 22, 2003 11:48 PM
 Subject: Range queries


 My apologies if this is a FAQ (which is possible as I am new to Lucene,
 however, I tried checking the web page for the answer).

 I read through the Query syntax web page first, and then checked the
 matching query classes. It seems like query syntax page is missing some
 details; the one I was wondering about was the range query. Since query
 parser seems to construct these queries, I guess they have been

 implemented,

 even though syntax page didn't explain them. Is that correct?

 Looking at QueryParser, it seems that inclusive range query uses [
 and ],

 and

 exclusive query { and }? Is this right? And does it expect exactly two
 arguments?
 Also, am I right in assuming that range uses lexiographic ordering, so

 that it

 basically includes all possible words (terms) between specified terms

 (which

 will work ok with numbers/dates as long as they have been padded with

 zeroes

 or such)?

 Another question I have is regarding wildcard search. Page mentions
 that

 there

 is a restriction that search term can not start with a wild card (as
 that
 would render index useless I guess... would need to full scan?).
 However,

 it

 doesn't mention if multiple wildcards are allowed? All the example
 cases

 just

 have single wild card?

 Sorry for the newbie questions,

 -+ Tatu +-

 ps. Thanks for the developers for the neat indexing engine. I am
 currently
 evaluating it for use in a large-scale enterprise content management

 system.



 --
 To unsubscribe, e-mail:

 mailto:[EMAIL PROTECTED]

 For additional commands, e-mail:

 mailto:[EMAIL PROTECTED]





 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]




 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]



--
To unsubscribe, e

Re: Range queries

2003-01-22 Thread Tatu Saloranta
On Wednesday 22 January 2003 07:49, Erik Hatcher wrote:
 Unfortunately I don't believe date field range queries work with
 QueryParser, or at least not human-readable dates.

 Is that correct?

 I think it supports date ranges if they are turned into a numeric
 format, but no human would type that kind of query in.  I'm sure
 supporting true date range queries gets tricky with locale issues and
 such too.

Right. In my case that's ok -- the documents I'll be indexing are hybrid 
documents, with some structured/plain text content and additional metadata 
(in DB normalized form). Thus the dates (from normalized metadata fields) can 
easily be converted to numeric form and indexed (for things like last 
modified etc that'd be normally searched via DB).

The other part (UI) needs more work... either need to add a new quoting
mechanism for dates (or just do that for if certain field prefix is used), or 
(more likely) the UI will use simple web forms for constructing query.

Thanks to everyone for quick replies,

-+ Tatu +-


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Range queries

2003-01-22 Thread Tatu Saloranta
On Wednesday 22 January 2003 08:27, Michael Barry wrote:
 I utilize the earlier version and queries such as this work fine with
 QueryParser:

 field:[ 20030120 - 20030125 ]

 of course the back-end indexer canonocalizes all date fields to MMDD.
 The front-end search code is responsible for canonocalizing the user
 inputed dates to MMDD. I think the key here would be either to not
 allow users to
 enter free-form dates (provide some type of UI element to enter year,
 month, day seperately) or give some copy stating dates should be in
 MMDD format.

Thanks, this is along the lines I was thinking too.

-+ Tatu +-



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Range queries

2003-01-22 Thread Erik Hatcher
I wanted to see this first-hand, so I wrote some test code to 
understand how dates are represented and how QueryParser deals with 
them.  I've indexed 500 documents with random dates between 1/1/2002 
and 12/31/2002.

Here's what works:

QueryParser parser = new QueryParser(contents, new 
StandardAnalyzer());
String begin = DateField.dateToString(new Date(102, 0, 01)); // 
20020101
String end = DateField.dateToString(new Date(102, 11, 31)); // 
20021231
String q = date:[ + begin +  TO  + end + ];
System.out.println(q =  + q);
Query query = parser.parse(q);
System.out.println(query =  + query.toString(date));
Hits hits = searcher.search(query);
System.out.println(# found =  + hits.length());

Here's the output:

q = date:[0cvx9a8w0 TO 0daddkbk0]
query = [0cvx9a8w0-0daddkbk0]
# found = 500

If I change begin and end to 20020101 and 20021231 respectively I 
get zero hits.

I'm running the latest Lucene version from CVS, in case that makes a 
difference.

So, while I would love it if QueryParser behaved with the MMDD 
syntax, it does not.  Or am I missing something here?

Any JavaCC wizzes out there that could modify it to take readable date 
formats and construct the query using the dateToString?  That would be 
sweet!Has anyone created any JavaScript that mimics the 
dateToString functionality that you'd share?

	Erik



On Wednesday, January 22, 2003, at 10:20  AM, Terry Steichen wrote:
Erik,

I believe the question was on range queries in general, which of 
course work
with the QueryParser.

You can use range queries for dates, provided, as I believe you imply, 
the
dates are in lexiographic order (ie, 20030122).  (As to whether dates
expresed as such are too challenging for the average human being, I 
don't
know.)

Regards,

Terry

PS: Just to clarify, I believe that dates represented this way are
internally treated as strings by Lucene.

- Original Message -
From: Erik Hatcher [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, January 22, 2003 9:49 AM
Subject: Re: Range queries


Unfortunately I don't believe date field range queries work with
QueryParser, or at least not human-readable dates.

Is that correct?

I think it supports date ranges if they are turned into a numeric
format, but no human would type that kind of query in.  I'm sure
supporting true date range queries gets tricky with locale issues and
such too.

Erik


On Wednesday, January 22, 2003, at 09:19  AM, Terry Steichen wrote:

Tatu,

I believe the range query syntax for the latest Lucene version is
field:[lower TO upper], or field:[null TO upper], or 
field:[lower
TO
null].  In earlier versions replace TO with a dash (-).

I also believe that multiple wildcards (? and/or *) work just 
fine
(as
long as they aren't the first character of the term).

HTH,

Terry

- Original Message -
From: Tatu Saloranta [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, January 22, 2003 11:48 PM
Subject: Range queries


My apologies if this is a FAQ (which is possible as I am new to
Lucene,
however, I tried checking the web page for the answer).

I read through the Query syntax web page first, and then checked 
the
matching query classes. It seems like query syntax page is missing
some
details; the one I was wondering about was the range query. Since
query
parser seems to construct these queries, I guess they have been
implemented,

even though syntax page didn't explain them. Is that correct?

Looking at QueryParser, it seems that inclusive range query uses



and ],

and

exclusive query { and }? Is this right? And does it expect exactly 
two
arguments?
Also, am I right in assuming that range uses lexiographic ordering, 
so
that it

basically includes all possible words (terms) between specified 
terms
(which

will work ok with numbers/dates as long as they have been padded 
with
zeroes

or such)?

Another question I have is regarding wildcard search. Page mentions
that

there

is a restriction that search term can not start with a wild card (as
that
would render index useless I guess... would need to full scan?).
However,

it

doesn't mention if multiple wildcards are allowed? All the example
cases

just

have single wild card?

Sorry for the newbie questions,

-+ Tatu +-

ps. Thanks for the developers for the neat indexing engine. I am
currently
evaluating it for use in a large-scale enterprise content management

system.



--
To unsubscribe, e-mail:

mailto:[EMAIL PROTECTED]

For additional commands, e-mail:

mailto:[EMAIL PROTECTED]






--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]





--
To unsubscribe, e-mail:

mailto:[EMAIL PROTECTED]

For additional commands, e-mail:

mailto:[EMAIL PROTECTED]





--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto

Re: Range queries

2003-01-22 Thread Erik Hatcher
Followup: after looking at QueryParser.jj (my apologies for not looking 
deeper before), it *does* support readable date range queries.  But the 
dates must both be valid and adhere to DateFormat.SHORT style.  I 
changed the code I previously sent to this:

String begin = 1/1/02;
String end = 12/31/02;

And the results were the same, all expected documents were returned.  
It does not work to use null for begin or end to leave either side of 
the range open-ended.

	Erik


On Wednesday, January 22, 2003, at 08:56  PM, Erik Hatcher wrote:
I wanted to see this first-hand, so I wrote some test code to 
understand how dates are represented and how QueryParser deals with 
them.  I've indexed 500 documents with random dates between 1/1/2002 
and 12/31/2002.

Here's what works:

QueryParser parser = new QueryParser(contents, new 
StandardAnalyzer());
String begin = DateField.dateToString(new Date(102, 0, 01)); 
// 20020101
String end = DateField.dateToString(new Date(102, 11, 31)); // 
20021231
String q = date:[ + begin +  TO  + end + ];
System.out.println(q =  + q);
Query query = parser.parse(q);
System.out.println(query =  + query.toString(date));
Hits hits = searcher.search(query);
System.out.println(# found =  + hits.length());

Here's the output:

q = date:[0cvx9a8w0 TO 0daddkbk0]
query = [0cvx9a8w0-0daddkbk0]
# found = 500

If I change begin and end to 20020101 and 20021231 respectively I 
get zero hits.

I'm running the latest Lucene version from CVS, in case that makes a 
difference.

So, while I would love it if QueryParser behaved with the MMDD 
syntax, it does not.  Or am I missing something here?

Any JavaCC wizzes out there that could modify it to take readable date 
formats and construct the query using the dateToString?  That would be 
sweet!Has anyone created any JavaScript that mimics the 
dateToString functionality that you'd share?

	Erik


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Range queries

2003-01-22 Thread Erik Hatcher
So you are using 1.3dev1?

I'm not having the same effect you are - and looking at QueryParser.jj 
it does this:

  private Query getRangeQuery(String field,
  Analyzer analyzer,
  String part1,
  String part2,
  boolean inclusive)
  {
boolean isDate = false, isNumber = false;

try {
  DateFormat df = DateFormat.getDateInstance(DateFormat.SHORT);
  df.setLenient(true);
  Date d1 = df.parse(part1);
  Date d2 = df.parse(part2);
  part1 = DateField.dateToString(d1);
  part2 = DateField.dateToString(d2);
  isDate = true;
}
catch (Exception e) { }
.
.
.
  }

I get all documents if I use date:[0 TO 20021231] but none if I do 
date:[20020101 TO null], and from the code I'm surmising I'm just 
lucky in the first case because its using a string range query and they 
all just happen to fall in that range (all dateToString's begin with 
0).  In the second case I'm getting nothing because no dateToString's 
are in that range textually.

So are you certain you're seeing the results you report?  I cannot 
explain the results you see from the code above.  Maybe I'm way off in 
my understanding of this - but I really don't see how MMDD works 
accurately.  DateFormat.parse throws an exception on strings in that 
format (using DateFormat.SHORT), so QueryParser would never think its 
really a date and convert it properly to its string representation.

Ah, maybe its how we are indexing our fields differently?  How are you 
indexing your my_date_field?  I'm using this syntax:

	Field.Keyword(fieldName, new Date())

Maybe you are indexing it as a String with MMDD format?  If so, 
that explains it.

	Erik

On Wednesday, January 22, 2003, at 10:55  PM, Terry Steichen wrote:
As of Lucene 1.3dev1 at least, some range query syntax changes were 
made.
If you use dates expressed as MMDD  (without doing the dateToString
conversion), the expression my_date_field:[0 TO mmdd] returns all
entries up to the date mmdd, and the expression 
my_date_field:[mmdd
TO null] returns all entries from the date mmdd forward.  (I just
verified this.)

Terry


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Range queries

2003-01-22 Thread Erik Hatcher
I just realized how dense I've been on this thread.  All along Terry 
has been saying he's indexing date fields as MMDD String fields 
and I just wasn't getting it.  I had my brain locked into thinking the 
fields were being indexed as Date fields.

My apologies for the run around on this thread.  Back to your regular 
programming...

On Wednesday, January 22, 2003, at 11:28  PM, Erik Hatcher wrote:
Ah, maybe its how we are indexing our fields differently?  How are you 
indexing your my_date_field?  I'm using this syntax:

	Field.Keyword(fieldName, new Date())

Maybe you are indexing it as a String with MMDD format?  If so, 
that explains it.

	Erik


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Range queries

2003-01-22 Thread saif
Hi,

Thanks man you solved my problem.

Regards,
Saif...

- Original Message -
From: Terry Steichen [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Thursday, January 23, 2003 9:25 AM
Subject: Re: Range queries


 As of Lucene 1.3dev1 at least, some range query syntax changes were made.
 If you use dates expressed as MMDD  (without doing the dateToString
 conversion), the expression my_date_field:[0 TO mmdd] returns all
 entries up to the date mmdd, and the expression
my_date_field:[mmdd
 TO null] returns all entries from the date mmdd forward.  (I just
 verified this.)

 Terry


 - Original Message -
 From: Erik Hatcher [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, January 22, 2003 10:41 PM
 Subject: Re: Range queries


  Followup: after looking at QueryParser.jj (my apologies for not looking
  deeper before), it *does* support readable date range queries.  But the
  dates must both be valid and adhere to DateFormat.SHORT style.  I
  changed the code I previously sent to this:
 
   String begin = 1/1/02;
   String end = 12/31/02;
 
  And the results were the same, all expected documents were returned.
  It does not work to use null for begin or end to leave either side of
  the range open-ended.
 
  Erik
 
 
  On Wednesday, January 22, 2003, at 08:56  PM, Erik Hatcher wrote:
   I wanted to see this first-hand, so I wrote some test code to
   understand how dates are represented and how QueryParser deals with
   them.  I've indexed 500 documents with random dates between 1/1/2002
   and 12/31/2002.
  
   Here's what works:
  
   QueryParser parser = new QueryParser(contents, new
   StandardAnalyzer());
   String begin = DateField.dateToString(new Date(102, 0, 01));
   // 20020101
   String end = DateField.dateToString(new Date(102, 11, 31)); //
   20021231
   String q = date:[ + begin +  TO  + end + ];
   System.out.println(q =  + q);
   Query query = parser.parse(q);
   System.out.println(query =  + query.toString(date));
   Hits hits = searcher.search(query);
   System.out.println(# found =  + hits.length());
  
   Here's the output:
  
   q = date:[0cvx9a8w0 TO 0daddkbk0]
   query = [0cvx9a8w0-0daddkbk0]
   # found = 500
  
   If I change begin and end to 20020101 and 20021231 respectively I
   get zero hits.
  
   I'm running the latest Lucene version from CVS, in case that makes a
   difference.
  
   So, while I would love it if QueryParser behaved with the MMDD
   syntax, it does not.  Or am I missing something here?
  
   Any JavaCC wizzes out there that could modify it to take readable date
   formats and construct the query using the dateToString?  That would be
   sweet!Has anyone created any JavaScript that mimics the
   dateToString functionality that you'd share?
  
   Erik
 
 
  --
  To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 
 


 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Range queries

2003-01-21 Thread Tatu Saloranta
My apologies if this is a FAQ (which is possible as I am new to Lucene, 
however, I tried checking the web page for the answer).

I read through the Query syntax web page first, and then checked the 
matching query classes. It seems like query syntax page is missing some 
details; the one I was wondering about was the range query. Since query 
parser seems to construct these queries, I guess they have been implemented, 
even though syntax page didn't explain them. Is that correct?

Looking at QueryParser, it seems that inclusive range query uses [ and ], and 
exclusive query { and }? Is this right? And does it expect exactly two 
arguments?
Also, am I right in assuming that range uses lexiographic ordering, so that it
basically includes all possible words (terms) between specified terms (which 
will work ok with numbers/dates as long as they have been padded with zeroes 
or such)?

Another question I have is regarding wildcard search. Page mentions that there 
is a restriction that search term can not start with a wild card (as that 
would render index useless I guess... would need to full scan?). However, it 
doesn't mention if multiple wildcards are allowed? All the example cases just 
have single wild card?

Sorry for the newbie questions,

-+ Tatu +-

ps. Thanks for the developers for the neat indexing engine. I am currently 
evaluating it for use in a large-scale enterprise content management system.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Do range queries work?

2001-11-01 Thread Paul Friedman

First, let me apologize for not including a self-contained test in my
original submission.  I will do so for any future issues.

In regards to the test case that you included in your response...

It looks like there is a bug (besides the StandardAnalyzer parsing 20-35 as
a single term).  The query in your example:

 search(searcher, analyzer, FirstName:[a-k]);

is not finding the correct document.  It is finding doc2, it should find
doc1.  QueryParser is parsing the query into FirstName:[k-null] when it
should be FirstName:[a-k].  Is a being caught as a stop word?

Thanks for the quick response.
Paul  

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]]
Sent: Thursday, November 01, 2001 11:36 AM
To: 'Lucene Users List'
Subject: RE: Do range queries work?


Can folks please try to include complete, self-contained test cases when
submitting bugs?  It's not that hard, and makes it much easier to figure out
what is going on.

For example, I have attached a complete, self-contained test case for the
bug reported below.  It only took 50 lines.  Interestingly, the results it
returns are quite different than those reported.  Not that it works as
desired!  My results are:

  age:[20-35-null]: 2 hits ( doc1 doc2 )
  age:[35-50-null]: 1 hits ( doc2 )
  FirstName:[k-null]: 1 hits ( doc2 )
  FirstName:[u-z]: 1 hits ( doc2 )
  FirstName:[adam-larry]: 1 hits ( doc1 )

(Note that I'm re-printing the parsed query to show how it was parsed.)  It
looks like StandardAnalyzer parses 20-35 as a single term, causing problems
with the age queries.  The FirstName queries work as desired for me.  I
suspect that in your case the FirstName field may not have been added as a
Field.Text, but as a case-sensitive Field.Keyword.

In any case, it is much easier to figure out what is going on if a complete
test case is provided in the first place.

So far as I can tell, RangeQuery is doing what it is supposed to.

Doug

 -Original Message-
 From: Paul Friedman [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, November 01, 2001 6:46 AM
 To: 'Lucene Users List'
 Subject: Do range queries work?
 
 
 Is range searching working in RC2?
 
 I have the following documents in my index (using StandardAnalyzer):
 
 id:doc1
 age:30
 FirstName:John
 
 id:doc2
 age:40
 FirstName:Wendy
 
 The following queries do not return the expected results 
 (using QueryParser
 with StandardAnalyzer):
 
 age:[20-35] finds 2 documents, should find 1
 age:[35-50] finds 2 documents, should find 1
 
 FirstName:[a-k] finds 0 documents, should find 1
 FirstName:[u-z] finds 0 documents, should find 1
 FirstName:[adam-larry] finds 0 documents, should find 1
 
 Thanks.
 
 --
 To unsubscribe, e-mail:   
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Do range queries work?

2001-11-01 Thread Doug Cutting

 From: Paul Friedman [mailto:[EMAIL PROTECTED]]
 
 It looks like there is a bug (besides the StandardAnalyzer 
 parsing 20-35 as a single term).  The query in your example:
 
  search(searcher, analyzer, FirstName:[a-k]);
 
 is not finding the correct document.  It is finding doc2, it 
 should find doc1.  QueryParser is parsing the query into 
 FirstName:[k-null] when it
 should be FirstName:[a-k].  Is a being caught as a stop word?

It looks like it.

I think the real bug here is that QueryParser should not analyze the terms
in a range query.  I have modified QueryParser.jj to not do this and all of
your examples work much better.  The only downside that I can see is that
range queries, like prefix queries, become case sensitive.  That seems like
a good tradeoff to me.  Does anyone object to this?

Doug




 QueryParser.jj

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


RE: Do range queries work?

2001-11-01 Thread Brian Goetz


Can folks please try to include complete, self-contained test cases when
submitting bugs?  It's not that hard, and makes it much easier to figure out
what is going on.

In the WebMacro project, we've actually gone further than that -- no bug 
report is taken seriously without a JUnit test case that demonstrates 
it.  This has been very successful for us, and along the way, we've built a 
nice little suite of tests.  JUnit test cases are really easy to write, but 
so far, the only one in the Lucene repo is the one for the query parser.

I'd like to see the existing test programs converted into JUnit test cases 
-- I'm willing to do this if someone will tell me how they work and what 
they're supposed to output and how to invoke them.



--
Brian Goetz
Quiotix Corporation
[EMAIL PROTECTED]   Tel: 650-843-1300Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]