RE: Starts With x and Ends With x Queries

2005-02-08 Thread Chong, Herb
i would say that matching root words in German compounds is a text
analysis application.

Herb... 

-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 08, 2005 11:08 AM
To: Lucene Users List
Subject: Re: Starts With x and Ends With x Queries

That might be true ... but our application is not a text analysis 
aplication,
and it is also not intended to be a search engine. We use lucene just to

index our pages.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea
Erik Hatcher wrote:
On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote:
Hi Erik,
I'm not changing any functionality.  WildcardQuery will still 
support leading wildcard characters, QueryParser will still disallow 
them.  All I'm going to change is the javadoc that makes it sound 
like WildcardQuery does not support leading wildcard characters.

Erik

From what I was reading in the mailing list there are more lucene 
users that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

Now I will need to update lucene library to the latest version, and I 
need to patch it again.
Do you think it will be possible in the future to have a field in 
QueryParser,  boolean ALLOW_SUFFIX_QUERIES?

I have no objections to that type of switch.  Please submit a path to 
QueryParser.jj that implements this as an option with the default to 
disallow suffix queries, along with a test case and I'd be happy to 
apply it.
I'm pleased to hear that. I'm not very skilled in writing .jj files but 
I will try to do it in next days,

Sergiu
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea
Chong, Herb wrote:
commercial text analytics tools including search engines usually
tokenize with splitting of compound words for German.
Herb
That might be true ... but our application is not a text analysis 
aplication,
and it is also not intended to be a search engine. We use lucene just to 
index our pages.

 Best,
 Sergiu

-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 08, 2005 10:38 AM
To: Lucene Users List
Subject: Re: Starts With x and Ends With x Queries

From what I was reading in the mailing list there are more lucene users
that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Starts With x and Ends With x Queries

2005-02-08 Thread Chong, Herb
commercial text analytics tools including search engines usually
tokenize with splitting of compound words for German.

Herb 

-Original Message-
From: sergiu gordea [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 08, 2005 10:38 AM
To: Lucene Users List
Subject: Re: Starts With x and Ends With x Queries

 From what I was reading in the mailing list there are more lucene users

that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Starts With x and Ends With x Queries

2005-02-08 Thread Erik Hatcher
On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote:
Hi Erik,
I'm not changing any functionality.  WildcardQuery will still support 
leading wildcard characters, QueryParser will still disallow them.  
All I'm going to change is the javadoc that makes it sound like 
WildcardQuery does not support leading wildcard characters.

Erik
From what I was reading in the mailing list there are more lucene 
users that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

Now I will need to update lucene library to the latest version, and I 
need to patch it again.
Do you think it will be possible in the future to have a field in 
QueryParser,  boolean ALLOW_SUFFIX_QUERIES?
I have no objections to that type of switch.  Please submit a path to 
QueryParser.jj that implements this as an option with the default to 
disallow suffix queries, along with a test case and I'd be happy to 
apply it.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea
Hi Erik,
I'm not changing any functionality.  WildcardQuery will still support 
leading wildcard characters, QueryParser will still disallow them.  
All I'm going to change is the javadoc that makes it sound like 
WildcardQuery does not support leading wildcard characters.

Erik
From what I was reading in the mailing list there are more lucene users 
that would like to be able to construct sufix queries.
They are very usefull for german language, because it has many long 
composite words , created by concatenation of other simple words.
This is one of the requirements of our system. Therefore I needed to 
patch lucene to make QueryParser to allow SufixQueries.

Now I will need to update lucene library to the latest version, and I 
need to patch it again.
Do you think it will be possible in the future to have a field in 
QueryParser,  boolean ALLOW_SUFFIX_QUERIES?

Thanks for understanding,
 Sergiu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-07 Thread Luke Shannon
I implemented this concept for my ends with query. It works very well!

- Original Message - 
From: "Chris Hostetter" <[EMAIL PROTECTED]>
To: "Lucene Users List" 
Sent: Friday, February 04, 2005 9:37 PM
Subject: Re: Starts With x and Ends With x Queries


>
> : Also keep in mind that QueryParser only allows a trailing asterisk,
> : creating a PrefixQuery.  However, if you use a WildcardQuery directly,
> : you can use an asterisk as the starting character (at the risk of
> : performance).
>
> On the issue of "ends with" wildcard queries, I wanted to throw out and
> idea that i've seen used to deal with matches like this in other systems.
> I've never acctually tried this with Lucene, but I've seen it used
> effectively with other systems where the goal is to "sort" strings by the
> least significant (ie: right most) characters first.  I think it could
> apply nicely to people who have compelling needs for efficent 'ends with'
> queries.
>
>
>
> Imagine you have a field call name, which you can already do efficient
> prefix matching on using the PrefixQuery class.  Your docs and query may
> look something like this...
>
>D1> name:"Adam Smith" age:13 state:CA ...
>D2> name:"Joe Bob" age:42 state:WA ...
>D3> name:"John Adams" age:35 state:NV ...
>D3> name:"Sue Smith" age:33 state:CA ...
>
> ...and your queries may look something like...
>
>Query q1 = new PrefixQuery(new Term("name","J*"));
>Query q2 = new PrefixQuery(new Term("name","Sue*"));
>
> If you want to start doing suffix queries (ie: all names ending with
> "s", or all names ending with "Smith") one approach would be to use
> WildcarQuery, which as Erik mentioned, will allow you to use a quey Term
> that starts with a "*". ie...
>
>Query q3 = new WildcardQuery(new Term("name","*s"));
>Query q4 = new WildcardQuery(new Term("name","*Smith"));
>
> (NOTE: Erik says you can do this, but the docs for WildcardQuery say you
> can't I'll assume the docs are wrong and Erik is correct.)
>
> The problem is that this is horrendously inefficient.  In order to find
> the docs that contain Terms which match your suffix, WildcardQuery must
> first identify what all of those Terms are, by iterating over every Term
> in your index to see if they match the suffix.  This is much slower then a
> PrefixQuery, or even a WildcardQuery that has just 1 initial character
> before a "*" (ie: "s*foobar"), because it can then seek to directly to the
> first Term that starts with that character, and also stop iterating as
> soon as it encounters a Term that no longer begins with that character.
>
> Which leads me to my point: if you denormalize your data so that you store
> both the Term you want, and the *reverse* of the term you want, then a
> Suffix query is just a Prefix query on a reversed field -- by sacrificing
> space, you can get all the speed efficiencies of a PrefixQuery when doing
> a SuffixQuery...
>
>D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ...
>D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ...
>D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ...
>D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ...
>
>Query q1 = new PrefixQuery(new Term("name","J*"));
>Query q2 = new PrefixQuery(new Term("name","Sue*"));
>Query q3 = new PrefixQuery(new Term("rname","s*"));
>Query q4 = new PrefixQuery(new Term("rname","htimS*"));
>
>
> (If anyone sees a flaw in my theory, please chime in)
>
>
> -Hoss
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Starts With x and Ends With x Queries

2005-02-07 Thread Erik Hatcher
On Feb 7, 2005, at 2:07 AM, sergiu gordea wrote:
Hi Erick,

"In order to prevent extremely slow WildcardQueries, a Wildcard term 
must not start with one of the wildcards * or 
?."

I don't read that as saying you cannot use an initial wildcard 
character, but rather as if you use a leading wildcard character you 
risk performance issues.  I'm going to change "must" to "should".
Will this change available in the next realease of lucene? How do you 
plan to implement this? Will this be available as an atributte of  
QueryParser?
I'm not changing any functionality.  WildcardQuery will still support 
leading wildcard characters, QueryParser will still disallow them.  All 
I'm going to change is the javadoc that makes it sound like 
WildcardQuery does not support leading wildcard characters.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-06 Thread sergiu gordea
Hi Erick,

"In order to prevent extremely slow WildcardQueries, a Wildcard term 
must not start with one of the wildcards * or 
?."

I don't read that as saying you cannot use an initial wildcard 
character, but rather as if you use a leading wildcard character you 
risk performance issues.  I'm going to change "must" to "should". 
Will this change available in the next realease of lucene? How do you 
plan to implement this? Will this be available as an atributte of  
QueryParser?

 Best,
 Sergiu
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-06 Thread Chris Hostetter

: book Managing Gigabytes, making "*string*" queries drastically more
: efficient for searching (though also impacting index size).  Take the
: term "cat".  It would be indexed with all rotated variations with an
: end of word marker added:
...
: The query for "*at*" would be preprocessed and rotated such that the
: wildcards are collapsed at the end to search for "at*" as a
: PrefixQuery.  A wildcard in the middle of a string like "c*t" would
: become a prefix query for "t$c*".

That's a pretty slick trick.

Considering how many Terms the index would wind up containing in order to
denormalize the data in that way, I wonder if it would be more practicle
to index each of the characters as a seperate term, with the word repeated
after the "end of word" character, making wildcard searches into "phase"
searches (after doing preprocessing and rotating as you described).

Ie, index "cat" as:   c a t $ c a t
  search for "*at*" as a phrase search for "a t"
  search for "*at"  as a phrase search for "a t $"
  search for "c*t"  as a phrase search for "t $ c"

...i'm fairly certain that would keep the index size much smaller (the
number of terms would be much smaller, while the average term frequence
wouldn't really increase), but i'm not sure if it would actaully be any
faster.  it depends on the algorithm/performace of PhraseQuery -- which is
something I haven't really looked into.  It could very well be
significantly slower.


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Starts With x and Ends With x Queries

2005-02-06 Thread Erik Hatcher
On Feb 4, 2005, at 9:37 PM, Chris Hostetter wrote:
If you want to start doing suffix queries (ie: all names ending with
"s", or all names ending with "Smith") one approach would be to use
WildcarQuery, which as Erik mentioned, will allow you to use a quey 
Term
that starts with a "*". ie...

   Query q3 = new WildcardQuery(new Term("name","*s"));
   Query q4 = new WildcardQuery(new Term("name","*Smith"));
(NOTE: Erik says you can do this, but the docs for WildcardQuery say 
you
can't I'll assume the docs are wrong and Erik is correct.)
I assume you mean this comment on WildcardQuery's javadocs:
"In order to prevent extremely slow WildcardQueries, a Wildcard term 
must not start with one of the wildcards * or 
?."

I don't read that as saying you cannot use an initial wildcard 
character, but rather as if you use a leading wildcard character you 
risk performance issues.  I'm going to change "must" to "should".  And 
yes, WildcardQuery itself supports a leading wildcard character exactly 
as you have shown.

Which leads me to my point: if you denormalize your data so that you 
store
both the Term you want, and the *reverse* of the term you want, then a
Suffix query is just a Prefix query on a reversed field -- by 
sacrificing
space, you can get all the speed efficiencies of a PrefixQuery when 
doing
a SuffixQuery...

   D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ...
   D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ...
   D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ...
   D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ...
   Query q1 = new PrefixQuery(new Term("name","J*"));
   Query q2 = new PrefixQuery(new Term("name","Sue*"));
   Query q3 = new PrefixQuery(new Term("rname","s*"));
   Query q4 = new PrefixQuery(new Term("rname","htimS*"));
(If anyone sees a flaw in my theory, please chime in)
This trick has been mentioned on this list before, and is a good one.  
I'll go one step further and mention another technique I found in the 
book Managing Gigabytes, making "*string*" queries drastically more 
efficient for searching (though also impacting index size).  Take the 
term "cat".  It would be indexed with all rotated variations with an 
end of word marker added:

cat$
at$c
t$ca
$cat
The query for "*at*" would be preprocessed and rotated such that the 
wildcards are collapsed at the end to search for "at*" as a 
PrefixQuery.  A wildcard in the middle of a string like "c*t" would 
become a prefix query for "t$c*".

Has anyone tried this technique with Lucene?
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-04 Thread Peter Pimley
I sent this to the wrong address.  Sorry.
Peter Pimley wrote:

Well done.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-04 Thread Peter Pimley

Well done.
I was so annoyed with the humiliation-for-kicks this afternoon that I 
just practised my self-destruction technicques with some friends this 
evening ;)

As for configuration, java.lang.system.getenv will give you access to an 
environment variable.

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/System.html
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Starts With x and Ends With x Queries

2005-02-04 Thread Chris Hostetter

: Also keep in mind that QueryParser only allows a trailing asterisk,
: creating a PrefixQuery.  However, if you use a WildcardQuery directly,
: you can use an asterisk as the starting character (at the risk of
: performance).

On the issue of "ends with" wildcard queries, I wanted to throw out and
idea that i've seen used to deal with matches like this in other systems.
I've never acctually tried this with Lucene, but I've seen it used
effectively with other systems where the goal is to "sort" strings by the
least significant (ie: right most) characters first.  I think it could
apply nicely to people who have compelling needs for efficent 'ends with'
queries.



Imagine you have a field call name, which you can already do efficient
prefix matching on using the PrefixQuery class.  Your docs and query may
look something like this...

   D1> name:"Adam Smith" age:13 state:CA ...
   D2> name:"Joe Bob" age:42 state:WA ...
   D3> name:"John Adams" age:35 state:NV ...
   D3> name:"Sue Smith" age:33 state:CA ...

...and your queries may look something like...

   Query q1 = new PrefixQuery(new Term("name","J*"));
   Query q2 = new PrefixQuery(new Term("name","Sue*"));

If you want to start doing suffix queries (ie: all names ending with
"s", or all names ending with "Smith") one approach would be to use
WildcarQuery, which as Erik mentioned, will allow you to use a quey Term
that starts with a "*". ie...

   Query q3 = new WildcardQuery(new Term("name","*s"));
   Query q4 = new WildcardQuery(new Term("name","*Smith"));

(NOTE: Erik says you can do this, but the docs for WildcardQuery say you
can't I'll assume the docs are wrong and Erik is correct.)

The problem is that this is horrendously inefficient.  In order to find
the docs that contain Terms which match your suffix, WildcardQuery must
first identify what all of those Terms are, by iterating over every Term
in your index to see if they match the suffix.  This is much slower then a
PrefixQuery, or even a WildcardQuery that has just 1 initial character
before a "*" (ie: "s*foobar"), because it can then seek to directly to the
first Term that starts with that character, and also stop iterating as
soon as it encounters a Term that no longer begins with that character.

Which leads me to my point: if you denormalize your data so that you store
both the Term you want, and the *reverse* of the term you want, then a
Suffix query is just a Prefix query on a reversed field -- by sacrificing
space, you can get all the speed efficiencies of a PrefixQuery when doing
a SuffixQuery...

   D1> name:"Adam Smith" rname:"htimS madA" age:13 state:CA ...
   D2> name:"Joe Bob" rname:"boB oeJ" age:42 state:WA ...
   D3> name:"John Adams" rname:"smadA nhoJ" age:35 state:NV ...
   D3> name:"Sue Smith" rname:"htimS euS" age:33 state:CA ...

   Query q1 = new PrefixQuery(new Term("name","J*"));
   Query q2 = new PrefixQuery(new Term("name","Sue*"));
   Query q3 = new PrefixQuery(new Term("rname","s*"));
   Query q4 = new PrefixQuery(new Term("rname","htimS*"));


(If anyone sees a flaw in my theory, please chime in)


-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Starts With x and Ends With x Queries

2005-02-04 Thread Erik Hatcher
It matches both because you're tokenizing the name field.  In both 
documents, the name field has a "testing" term in it (it gets 
lowercased also).  A PrefixQuery matches terms that start with the 
prefix.  Use an untokenized field type (Field.Keyword) if you want to 
keep the entire original string as-is for searching purposes - however 
you'd have issues with case-sensitivity in your example.

Also keep in mind that QueryParser only allows a trailing asterisk, 
creating a PrefixQuery.  However, if you use a WildcardQuery directly, 
you can use an asterisk as the starting character (at the risk of 
performance).

Erik
On Feb 4, 2005, at 7:50 PM, Luke Shannon wrote:
Hello;
I have these two documents:
Text
Keyword
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text
Keyword
Keyword
Text
Text
Text
Text
Text
Text
Text
Text
Brand Ide.>
Text
Text

I would like to be able to match a name fields that starts with testing
(specifically) and those that end with it.
I thought the below code would parse to a Prefix Query that would 
satisfy my
starting requirment (maybe I don't understand what this query is for). 
But
this matches both.

Query query = QueryParser.parse("testing*", "name", new 
StandardAnalyzer());

Has anyone done this before? Any tips?
Thanks,
Luke

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Starts With x and Ends With x Queries

2005-02-04 Thread Luke Shannon
Hello;

I have these two documents:

Text
Keyword
Text
Text
Text
Text
Text
Text
Text
Text
Text


Text
Text
Text
Keyword
Keyword
Text
Text
Text
Text
Text
Text
Text
Text
Text
Text

I would like to be able to match a name fields that starts with testing
(specifically) and those that end with it.

I thought the below code would parse to a Prefix Query that would satisfy my
starting requirment (maybe I don't understand what this query is for). But
this matches both.

Query query = QueryParser.parse("testing*", "name", new StandardAnalyzer());

Has anyone done this before? Any tips?

Thanks,

Luke



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]